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Chapter 1 

Introduction 



A large class of dense linear algebra operations operations, such as LU decomposition or inversion 
of a triangular matrix, are usually performed by blocked algorithms. For one such operation, 
typically, not only one but many algorithmic variants exist; depending on architecture, libraries, 
and problem size, each variant attains a different performances. Our goal is to rank the algorithmic 
variants according to their performance for a given scenario without executing them. 

For this purpose, we analyze the routines upon which the algorithms are built and introduce a 
tool that, based on measurements, models their performance. The generated performance models 
are then used to predict the performance of the considered algorithmic variants. For a given 
scenario, these predictions allow us not only to rank the variants but to determine the optimal 
algorithmic block-size. 

The strength of our approach mainly originates from the performance models. Generated once 
for a given system, they can be used to analyze any number of blocked algorithms. Besides reliably 
predicting and ranking algorithm performance, they yield further insight into how the performance 
is influenced by certain factors such as the block-size. 

1.1 Motivating Example 

We consider an exemplary dense linear algebra operation: the inversion of a lower triangular matrix, 
L 4— L^ 1 . For this operation, there exist four different blocked algorithms (see Section 1.4). For 
now, all we need to understand is that all of them take a lower triangular matrix L € M™ x ™ as an 
input and compute its inverse in place; they accept only one additional algorithm parameter: the 
algorithmic block-size b. 

We use the following setup to analyze the four algorithms: 

• Their implementation (see Appendix B.l for the source code) is compiled with Intel's C 
Compiler (ice) version 12.0 [4]. 

• They are executed on one core of an Intel Harpertown E5450 processor [6] running at 
2.99GHz. This processor can issue 2 double precision floating point operations 1 per clock 
cycle. Therefore, it can perform up to peak _flops/s = 2 x 2.99 TO 9 floating point operations 
per second. 

1 We consider a fused multiply add operation (FMA) d <— a ■ b + c to be one floating point operation. It is the 
core of any dense matrix operation. 
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Figure 1.1: Inversion of a triangular matrix: execution time and efficiency. 



• The Intel's Math Kernel Library (MKL) [5] is used for the underlying Basic Linear Algebra 
Subroutines (BLAS) [20, 13]. 

Figure 1.1 shows the execution time and the efficiency 2 of the four algorithm variants. In 
Figures 1.1a and 1.1b, the algorithmic block-size is fixed to 96 and the matrix size n is varied. 
The results show significant differences in performance between algorithms: Variant 4 ( ) takes 
more than twice as long compared to the other three variants; it reaches a maximum efficiency of 
29%. Variant 1 (•) is also slower than variants 2 (•) and 3 (•) and attains an efficiency of 73%. 
Variants 2 (•) and 3 (•) seem equally good for matrices up to size n = 512. For larger matrices, 
variant 2 (•) becomes the fastest; ultimately, variants 2 (•) and 3 (•) reach efficiencies of 81% and 
86%, respectively. 

In Figures 1.1c and l.ld, the matrix size is fixed to n = 1000 and only the block-size varies. 



2 Section 2.1.1 contains details on the computation of efficiency. 
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For all variants, the efficiency decreases for very small and large block-sizes. Variants 1 (•), 2 (•), 
and 3 (•) reach their peak efficiency for block-sizes close to 100. 

This example shows that in order to reach high efficiency it is crucial to both choose the 
algorithmic variant as well as optimizing the block-size. Due to the complexity of the architecture 
and the memory access patterns, it is not possible to determine the optimal configuration by 
only analyzing the algorithms mathematically. On the contrary, the best choice depends on the 
matrix size, the underlying computational kernels, such as BLAS 3 , and the processor architecture; 
changing these may lead to entirely different performance behavior. 

In this thesis, we introduce tools and methods to analyze and model the performance of dense 
linear algebra kernels. These tools allow us to perform the challenging task of ranking algorithmic 
variants according to their performance and determining the optimal block size. Results for the 
example at hand are given in Section 4.2. 

1.2 Related Work 

Several different approaches of using performance modeling in dense linear algebra already exist. 

Iakymchuk et al. [19] model the performance of BLAS [20, 13] analytically based on memory 
access patterns. While their models represent the program execution very accurately, constructing 
them requires a high level of expertise on both the routines and the architecture. 

Cuenca et al. [12] develop a self-optimizing linear algebra routines (SOLAR). In their system, 
every routine is associated with performance information, which is hierarchically propagated to 
higher level routines (e.g., BLAS — > LAPACK — > ScaLAPACK); on each level, the information is 
used to tune the routines and associate according performance information. 

Dongarra et al. [14] propose a modeling approach targeted at programs such as High Per- 
formance LINPACK (HPL) [22] and ScaLAPACK [11, 9] . They employ sampling and polynomial 
fitting to construct their models and use them to extrapolate the performance of routines for larger 
problems and higher parallelism. 

1.3 Outline of this Thesis 

This thesis is structured as follows: 

• In Section 1.4, we give an introduction to blocked algorithms, the target of our predictions. 

• In Chapter 2, we discuss the Sampler, a tool that measures different performance metrics 
during the execution of dense linear algebra routines. 

• In Chapter 3, we introduce the Modeler, a framework that, based on the Sampler, generates 
analytical performance models for dense linear algebra routines. 

• In Chapter 4, we use the performance models generated by the Modeler to predict the 
performance of blocked algorithms and rank them accordingly. 

• In Chapter 5, we conclude with a summary of our achievements and an outlook on possible 
future research directions based on this thesis. 



3 An introduction to BLAS is given in Appendix A. 
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Figure 1.2: Triangular Inverse — traversal of L. 



1.4 Blocked Algorithms 

Since our goal is to rank blocked algorithms, in this section, we introduce their structure. We 
begin by studying blocked algorithms for a specific operation, the inversion of a triangular matrix 
(1.4.1), and then generalize the concepts to arbitrary operations (Section 1.4.2). 



1.4.1 Triangular Inverse L L 
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In Section 1.1, we discussed the performance of four blocked algorithms for the inversion of a 
lower triangular matrix. The algorithms take a lower triangular matrix L £ M. nxn as an input and 
compute its inverse in place. 

Within a blocked algorithm, the matrix L is seen in a partitioned form 



L 



Ltl 
Lbl 





Lbr 



where L TL e M pxp , L BL e 



pgxp 



and Lbl G M. qxq with p + q = n. Initially, p is 0, or equivalently, 



Ltl and Lbl are "empty". L is then traversed from the top left corner along its diagonal 



L 



in steps of the algorithmic block-size, increasing p up to n. During the traversal, the inverse of L 
is computed in Ltl] this part of the matrix grows in size until p = n. At this point, Ltl is of the 
size n x n and contains the inverse of the original matrix L; the algorithm terminates. 

Traversing L not element-wise but in blocks of size b has the advantage that BLAS Level-3 
operations, such as dgemm, can be used to attain high performance. 

At each step of the matrix traversal (depicted in Figure 1.2), L is repartitioned as 



Ltl 





Lbl 


Lbr 



Lqo 








L\o 


Ln 





L20 


L21 


L22 



with Ln £ R bx , L22 € W xr (with r = n— p — b), and conforming sizes for the other submatrices. 
(When n is not divisible by b, b is adjusted to b 4— n — p in the last step.) 

At this point, the four algorithms perform different updates on the 3x3 partitioned form of 
the matrix: 
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All but the last update statement in each algorithm map directly to one of the BLAS routines 
dtrsm, dtrmm, and dgemm (see Appendix A for an introduction to BLAS). The last update on the 
other hand is the inversion of Ln £ K bxf> — a recursive call to the inversion of a triangular matrix 
with a smaller input matrix of size 6x6. This recursive invocation, which is typical for blocked 
algorithms, is performed by an unblocked version of the algorithm (that is block-size 6=1). In 
this version, the last update is the scalar operation Ln = ^— £ R. 

Once the updates have been performed, the 3x3 partitioning of the matrix L is merged back 
into four quadrants 
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and p is incremented by 6. Unless p = n, the matrix is now again partitioned into 
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and the matrix traversal proceeds. When p = n, the algorithm terminates and L 1 has been 
computed. 

1.4.2 Generalization 

In the previous section, we presented four blocked algorithms for the inversion of a lower triangular 
matrix. We now generalize the principle of blocked algorithms step by step. 

Full matrix. Extending blocked algorithms to non-triangular matrices is straightforward. As- 
suming the same direction of traversal A, , an input matrix A is partitioned as follows: 
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The corresponding update and repartitioning are shown in Figure 1.3. 

Traversal directions. Until now, we considered the case where the input matrix A is traversed 



from the top left corner along its diagonal 



'A 



In principle, A can be traversed in any direction: 
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Figure 1.3: Blocked algorithms — update and repartitioning for a square matrix. 
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matrix diagonally, it is partitioned into 2x2 and 3 x 3 as shown above. In the other cases, the 
matrix is partitioned into two and then three matrices 
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A T 
A b 



A Q 

I Ai ] or A -> ( A L A R ) ( A A Y A 2 ) 
A 2 



and traversed in blocks of size b. 



Multiple matrices. When a blocked algorithm operates on multiple matrices, each of them can 
potentially be traversed along different directions. The block-size 6, however, is constant across all 
matrices. 

Non-square matrices. When non-square matrices are traversed horizontally or vertically, noth- 
ing changes compared to the square case. When, on the other hand, a non-square matrix is 
traversed diagonally, there are two alternatives: 

• Using two different block-sizes, leading to rectangular blocks on the diagonal, or 

• Traversing diagonally as far as possible and continuing along the remaining vertical or hori- 
zontal direction. 

We consider the latter. 

Without loss of generality, we assume that a matrix A € 



the top left corner 



A* 



mxn with m > n is traversed from 
The traversal of A is identical to the square case until the rightmost 



column is reached. At this point, the matrix is partitioned as follows: 

A 

where A TL G R mxm and A BL e rC"-™)*™; botn a tr and a br lmve a width of q co i umns . 



Atl Atr 
Abl Abr 
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Figure 1.4: Blocked algorithms — update and repartitioning for a non-square matrix. 



From this point on, the matrix is repartitioned, where all submatrices that originate from Atr 
and Atl have columns: 



.4 



.4 



TL 



A 



BL 




The new submatrices are of size A ao € R pxn , A w £ R bxn , and A 20 € R(™-P- b ) x ™ (see Figure 1.4). 
The updates are applied to these submatrices as usual. Those that involve empty matrices do not 
have any effect. 

At the end of the iteration, the matrix is partitioned as 




A 



TL 



A 



A B l 

and we update p -s— p + b. The algorithm terminates, once p — m (and inherently p > n). 
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Chapter 2 

Sampling 



In this section, we describe the construction of a tool for performance measurements of dense linear 
algebra (DLA) routine executions: The Sampler. 

We start by specifying our goal and the desired functionality of the Sampler (Section 2.1). 
Subsequently, we conduct a set of experiments that will aid us in the design and implementation 
of the Sampler (Section 2.2). In Section 2.3, we introduce the Sampler and discuss its design and 
interface. 

2.1 The Goal 

Our goal is to design a performance measurement tool that can be used as a basis for performance 
modeling. In the following, we will specify the Sampler's desired functionality, discussing 

• the performance metrics that are available and of interest to us (Section 2.1.1), 

• the interface used to specify measurement requests (Section 2.1.2), and 

• the configuration of the environment for the measurements (Section 2.1.3). 

2.1.1 Performance Metrics 

Before designing our performance measurement tool, we first need to clarify and structure our 
understanding of performance. 

We consider the performance of a routine execution to be a set of performance metrics. These 
describe several aspects of the execution and events occurring during its runtime. We divide 
performance metrics into two types: 

• Intrinsic performance metrics — henceforth called performance counters — are available 
through special CPU registers; 

• Derived performance metrics are computed from performance counters and further informa- 
tion on the execution environment. 

The time stamp counter, the most fundamental performance counter, is a register that is 
incremented once per CPU cycle. This register's value can be accessed directly through the x86 
instruction RDTSC (read time stamp counter). From now on, we use this highly accurate measure 
for execution time; the corresponding performance counter is referred to as ticks. 
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In order to measure further hardware events, each CPU offers between 2 and 5 hardware coun- 
ters that can be configured to count certain types of events. We use the Performance Application 
Programming Interface version 4.2.1.0 (PAPI) [21] to access these event counter. 

PAPI provides functions to configure, start, and read the counters. It supports up to 107 
different events; usually only a subset of these are available on a particular CPU. These events 
include, but are not limited to: 

Cache access events. For each cache level, PAPI can measure the number of cache reads, writes, 
hits, misses, and total accesses. Data and instruction caches can be treated separately ore 
combined. On some CPUs, PAPI can distinguish between cache misses occurring during 
load and store instructions. On multicore CPUs, the occurrences of cache coherency related 
events such as accesses to shared cache lines or cache line invalidations can be counted through 
PAPI. 

We denote the number of Level- 1 and Level-2 cache misses by the performance counters 
Limisses and L2misses, respectively. 

Translation Lookaside Buffer (TLB) events. PAPI provides counters for the number of TLB 
misses; data and instruction TLBs can again be treated separately or combined. We call the 
performance counter representing the total number of TLB misses TLBmisses. 

Instruction events. PAPI can count the number of instructions of specific types, including: load, 
store, branch, and floating point instructions (with further subdivisions). The performance 
counter for the number of floating point operations is subsequently called flops 1 . Further, 
the number of CPU cycles that specific instruction units are idle or stalling due to memory 
accesses can be measured separately. 

The complete list of event counters and those which are available on a specific CPU can be accessed 
through the shell command papi_avail. 

Derived performance metrics are computed from the performance counter and (possibly) ad- 
ditional information, for instance the CPU frequency or cache sizes. A list of common derived 
performance metrics is given below: 

• Floating point operations per second can be computed from flops, ticks, the CPU's frequency 
hz and the CPU's available floating point instructions per cycle fpipc: 

flops x hz 

flops/s — 



ticks x fpipc 

• The efficiency of a routine execution is computed from ticks, fpipc, and the number of 
mathematical operations performed mops 2 : 

,,. . mops 

eft iciencu = . 

ticks x fpipc 

• Level- 1 cache miss ratio is obtained from Limisses and the number of Level- 1 cache accesses 
Liaccesses: 

Liaccesses 

Li missratio — — . 

Limisses 

For the sampling tool, we only consider the directly measurable performance counters. 



1 flops is the number of floating point operations. The number of floating point operations per second is 
consistently denoted by flops/s. The property of a certain CPU that determines the maximum number of floating 
point operations available per second is referred to as peak f lops/s. 

2 Routines often perform more flops than an operation requires from a mathematical point of view. 
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2.1.2 Routines and Arguments 

We are interested in sampling dense linear algebra routines, such as BLAS or unblocked algorithms. 

When sampling a routine, we need to specify both its symbol and its arguments. Let us consider 
the following BLAS call: 

side uplo transA diag m n alpha A IdA B ldB 

dtrsm( R , L , N , U , 512, 128, 0.37, A, 256, B, 512). 

This invocation performs the operation B ■<— 0.37-EL4 -1 , where A £ jji28xi28 ^ g i ower triangular 
and has leading dimension 256 and B e Jj5i2xi28 w ^h leading dimension 512. We can execute the 
above call with different BLAS implementations. Thus, the routine can only be identified by both 
its name and its implementation. 

Since all BLAS implementations use the same interface, we cannot use several implementations 
in the same program. The simplest possible solution is to generate separate sampling programs, 
each linked with a different BLAS implementations. In each of these programs, a routine is uniquely 
identified by its name. 

All parameters apart form A and B are basic data types such as characters, integers, and floating 
point numbers; A and B on the other hand are matrices. At this point, we can exploit, that most 
dense linear algebra operations are mostly independent of their matrices' values; the floating point 
operations that are performed are solely determined by the other arguments. This means, all we 
need to sample a routine are sufficiently large memory chunks assigned to matrices and vectors 
such as A and B. Therefore, a request to sample the above call to dtrsm can be represented as the 
following tuple: 

(dtrsm, R, L, N, U, 512, 128, 0.37, 256 x 128, 256, 512 x 128, 512). 

Here, the data arguments have been replaced by their sizes in memory: A has leading dimension 
256 and 128 columns. Thus, dtrsm accesses a range of 256 x 128 = 32, 768 double precision floating 
point numbers for A. 

2.1.3 Systems and Environment 

The remaining concern is to control the sampling environment. This consists of the execution 
environment of the Sampler and the preconditions for each sample. The relevant aspects are the 
system architecture and the configuration of the BLAS library, such as support for multithreading. 

From within the sampler, we can influence the data locality of routine arguments, which may 
greatly influence performance. We need a mechanism to control where the routine's arguments 
reside in memory. Experiments in Section 2.2 will give us further understanding regarding the 
influence of memory locality on performance. This in turn will aid us in the design of our tool. 

The last important requirements for the Sampler are minimal overhead and performance dis- 
tortion. Both can be achieved by a clean design and structure. 

2.2 Experiments 

We now consider at a series of preliminary experiments that guide our design of the Sampler. In 
particular, we execute dtrsm to analyze the effects of various setups. 
The system configuration remains unchanged, that is: 

• One core of an Intel Harpertown E5450 processor running at 2.99GHz; 

• Intel's C Compiler (ice). 
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GotoBLAS2 (•) MKL (•) ATLAS (•) 

1 st median 1 st median 1 st median 



ticks 


5, 737, 806 


497, 790 


12,408,516 


451,026 


224, 973,414 


1,081,404 


flops 


611,992 


611,785 


646, 350 


609, 784 


1,217, 632 


1,216,491 


Li misses 


38, 762 


38,428 


98, 996 


28,690 


15, 248 


13,925 


efficiency 


5.30% 


61.1% 


2.45% 


67.4% 


0.135% 


28.1% 



Table 2.1: Initial measurement outliers. 



We use the high performance BLAS implementations GotoBLAS2, MKL, and ATLAS 3 . 

The first experiment (Section 2.2.1) is discussed in greater detail and serves as a reference for 
modifications of the setup (Section 2.2.2). 

2.2.1 Reference Experiment 

In the first experiment, we repeatedly execute 

side uplo transA diag m n alpha A IdA B ldB 
dtrsm( R , L , N , U , 128, 96, 0.37, A, 128, B, 128), 

where A and B are assigned fixed memory locations across all repetitions. 

We measure the performance counters ticks, flops, and Limisses, and compute the derived 
performance metric efficiency. The measurements are performed as follows: 

1. The performance counters are initialized using PAPI; 

2. The routine is executed; 

3. The performance counters are read through PAPI: 

4. The results are written to a file; 

These steps are repeated 1000 times, collecting all results. 

The first execution of each series of experiments reproducibly yields a significantly higher and 
thus less efficient result. Table 2.1 shows these series' first measurements and their medians. Espe- 
cially MKL's first result is very large. These outliers are due to the BLAS implementations, which, 
when called for the first time, perform a number of initializations, such as memory allocations, 
system architecture identification, and algorithm selection. From the observation of the initial 
outliers, we learn to discard the first experiment's result in each program invocation. We do not 
integrate this idea into the Sampler, since we might later use it to analyze exactly this behavior. 
However, we have to keep the outliers in mind when we use the Sampler to measure performance, 
for example, for the construction of performance models in Section 3.2.1. 

All but the first outlier measurement are shown in Figure 2.1. We observe the following: For the 
faster implementations GotoBLAS2 (•) and MKL (•) the measured ticks and, thus, the computed 
eff iciency fluctuate in a range of 8%. flops and Limisses on the other hand are almost constant. 
GotoBLAS2 (•) and MKL (•) attain efficiencies of 61% and 67%, respectively; this relatively low 
performance is due to the small problem size of 96 x 128. 
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Figure 2.1: Repeated execution of dtrsm. 



2.2.2 Modified Setups 

We now modify the experiment setup used in Section 2.2.1 and investigate its influence on the 
measurements. The results of the modifications for the same performance metrics as in the ref- 
erence experiment are shown in Figure 2.2. Each experiment was executed 1000 times and the 
corresponding plots show the series' median. 

(a) Separation of sampling and IO: In this modification the 1000 repeated routine invocations are 
executed consecutively without writing the results to the output. Instead, the results are stored 
in memory and only written to the standard output stream after all 1000 samples were taken. 
As we can see in Figure 2.2, this modification did not affect the performance measurements. 



3 See Section A. 4 for further detail on these implementations. 
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Figure 2.2: Modifications of the experiment setup for dtrsm. 



However, we observe a slight decrease in the total runtime of the experiment series of 0.7 
seconds, or 0.7 milliseconds (2, 137,850 ticks) per sample. 

(b) Random matrix memory locations: In this modification, the routine's matrix arguments are 
randomly taken from a 2GB large memory chunk. Using this configuration, the results differ 
significantly: While flops and Limisses remain unchanged, we experience an increase in 
execution time and decrease in efficiency. This is expected, since the number of floating point 
instructions is unchanged, while the memory accesses are to main memory instead of the CPU's 
caches. Both reusing the same memory locations to gain increased memory locality, as well 
as accessing memory areas which do not reside in the caches — called cache trashing — are 
interesting scenarios. 
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2.3 The Sampler 

With gained insights into the relevant factors for measuring performance, we turn to the Sampler. 
This tool is designed to fulfill the requirements specified in Section 2.1. Section 2.3.1 gives an 
overview of the design and mechanisms used in the Sampler's implementation. We then turn to its 
usage and interface in Section 2.3.2. The tool is employed in a series of experiments in Section 3.1. 

2.3.1 Design 

The Sampler is designed as a flexible lightweight performance measurement tool. It is written in 
C and can, hence, directly interfaces to libraries such as BLAS [20, 13] or LAPACK [10, 7]. 

The used libraries and the list of routines that can be sampled are incorporated into the 
Sampler during building process. Given a list of header files for the routines, specific object 
files are generated. These contain information on the routines signatures necessary to read and 
execute corresponding sampling requests. Linking these object files with the implementations 
of the routines yields the Sampler. The same object files can be linked with different routine 
implementations (e.g., GotoBLAS2 or MKL) to create separate Samplers for each of them. In 
order to sample multithreaded routines, the implementations need to be configured either before 
linking or through the system environment at runtime, according to the libraries configuration 
system. 

We discuss the structure of the Sampler by looking at its program flow. The outline of the 
execution of the sampler is as follows: 

1. Initialization. 

2. While the end of input stream is not reached: 

(a) Read a set of sampling requests and prepare their execution; 

(b) Execute and sample the routines; 

(c) Write the results to the output. 

3. Finalization. 

We will now consider each of these stages in more detail. 

Initialization. The initialization of the Sampler starts by reading a specified configuration file 
(a full list of possible configurations is given in Section 2.3.2.1). In case PAPI is requested, the 
PAPI library is initialized. 

Then, the Sampler sets up the input and output streams. These are used to read sampling 
requests and write sampling results. These streams, which by default are the programs standard 
10 streams, can be redirected to files. 

Next, the Sampler allocates a large contiguous chunk of memory of configurable size. This 
memory is used for the vector and matrix arguments of the sampled routines. It is initialized with 
random double precision numbers between and 1. 

Reading sampling requests. Once the main loop is reached, the Sampler begins to read sam- 
pling requests from the input stream (their syntax is given in Section 2.3.2.2). For each request, 
the Sampler's assigns memory locations within the preallocated memory chunk to sampled rou- 
tine's vector and matrix arguments. Different memory access patterns (such as cache trashing or 
in-cache) can be configured (see Section 2.3.2.1). 

There are three conditions upon which the Sampler stops reading requests: 
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• The (configurable) maximum number of routine executions that are sampled in one block is 
reached; 

• A special command "go" was read from the input stream; 

• The end of the input stream is reached. 

Executing sampling requests. In this step, the Sampler executes all sampling requests read 
in the previous step. The resulting measurements are stored in memory for each execution. 10 
and sampling are separated in order to reduce possible interference and overhead. 

Writing the results to the output. Now, the measurement results are written to the output 
stream (the output format is given in Section 2.3.2). The structures used to handle the sampling 
requests are cleaned such that they can be reused in the next iteration of the main loop. 

Finalization. After the main loop terminates, the Sampler frees the used memory. 
2.3.2 Interface and Usage 

The Sampler's interface is designed to be as clean and flexible as possible; the Sampler can be used 
either as an interactive tool in a shell or by other programs and scripts. The interface consists of 
three major parts: 

• The configuration file (Section 2.3.2.1), 

• The input stream format (Section 2.3.2.2), and 

• The output stream format (Section 2.3.2.3). 

2.3.2.1 Configuration File 

The Sampler is invoked with only one argument — the path to a configuration file: 

sampler <configuration file> 

The configuration file consists of several options, each in an individual line. Lines beginning with 
a # are ignored. The following options are available: 

• input = file: The input is to be read from file. By default the input stream is the 
program's standard input stream. 

• output = file: The sampling results are to be redirected to file. The default is the 
standard output. 

• usepapi = 1 1 0: Whether PAPI is to be used for performance counter sampling (1) or not 
(0). Without PAPI, only ticks is measured. 

• ncounters = n : The number of PAPI performance counters to be used. This number is 
inherently limited by the systems CPU architecture. 

• counters[i] = counter_id: The i-th PAPI performance counter is set to counter_ id — 
any valid PAPI event name. On systems with PAPI, a full list of available event names is 
available through the shell command papi_avail. 
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• maxcalls = n: n sampling requests are to be executed in a block. 

• mem_size = n: The Sampler assigns memory chunks to routine arguments from an allocation 
of n bytes. 

• mem_policy = n: The memory policy is identified by a number between and 3: 

0: Static. Each routine is assigned disjoint memory chunks from the start of the Sampler's 
memory. Since this configuration will assign the same memory locations over and over 
again, this configuration can be used to simulate cache locality. 

1: Forward. Each assigned memory chunk begins where the last ended, such that there is no 
overlap between chunks. Once the end of the Sampler's memory is used, the assignment 
starts over from its beginning. This configuration ensures that there are no overlaps 
between memory chunks in consecutive routine executions. It can therefore be used to 
achieve cache trashing. 

2: Backward. The memory chunks are assigned in backward order, starting from the end 
of the Sampler's memory. The concept of this policy is the same as for the Forward 
mode (1). 

3: Random. The memory chunks are taken randomly from within the Sampler's memory. 
If the Sampler's memory is large enough, the chances of overlapping memory regions 
is minimal, leading to cache trashing. If on the other hand, the memory allocation is 
small, overlapping memory assignments are more likely, even for chunks assigned to the 
same routine execution. 

Varying the memory policy will allow to represent certain scenarios in the estimation of 
performance in the 

• mem_align = n : Every assigned memory chunk will be aligned in memory by blocks of n 
bytes. This configuration may be used to achieve cache alignment, ensuring that each matrix 
and vector starts in a new cache line. 

• state_queries = Oil: When set to 1, the Sampler prints queries to the standard output, 
stating which routine type of arguments are expected (e.g., int* or double*). 

• show_progress = Oil: When set to 1, the Sampler prints its progress to standard output. 

• matlab_output = Oil: Activating this option leads to output in Matlab matrix notation. 
This is useful when Matlab is used to process the sampling results. 

All options that take values or 1 are by default and may be omitted. The options for 
PAPI counters are only valid with a previous usepapi = 1. In this case, the number of specified 
performance counters has to match the number given in ncounters = n . 



2.3.2.2 Input Stream Format 

The format of the input stream agrees with the considerations in Section 2.1.2. A request consists 
of a routine name followed by that routine's argument specifications. The only exception is the 
command go, which may be issued between sampling requests. Upon encountering this command, 
the Sampler immediately starts the sampling process, prints out the results, and only then continues 
to read its input stream. This command is needed when the sampler is used interactively or by 
another program in order to initiate the sampling process. 
A request to sample 



18 



CHAPTER 2. SAMPLING 



side uplo transA diag m n alpha A IdA B ldB 

dtrsm( R , L , N , U , 128, 96, 0.37, A, 128, B, 128), 
for instance is submitted by the following line on the input stream: 

dtrsm R L N U 128 96 v. 37 16384 128 16384 128. 
From the routine's signature, the Sampler know how to interpret the arguments. 

• The capital characters are allowed for char* arguments. 

• int* arguments take an integer value. 

• For float* or double* arguments there are two possibilities: 

— Given an integer, a memory chunk of this size will be assigned to this argument. 

— Given the character v (for "value") followed by a floating point number, the number is 
passed to the routine as a single floating point number. This is useful to cover special 
cases such as 0, 1 or —1, which might trigger separate routine branches. 

2.3.2.3 Output Stream Format 

The output format is very similar to the input format: For each sampling request, the name of the 
routine is printed, followed by its char* and int* arguments 4 . After the arguments, the sampling 
results are written to the output stream, followed by a new-line character. 

The first sampling result is always the performance counter ticks. It is followed by PAPI's 
performance counter results as specified in the configuration file. 

For instance, the output for the request 

dtrsm R L N U 128 96 v0.37 16384 128 16384 128 
with PAPI counters flops and Limisses might be: 

name char* int* ticks flops Limisses 

dtrsm nTO 128 96 128 128" 1079973 1217382 13509 



4 float* and double* arguments are omitted. 



Chapter 3 

Modeling 



In the previous Chapter we discussed the Sampler — a tool for measuring the performance of 
DLA routines. From measurements obtained by the Sampler, we now want to construct analytical 
performance models. For this purpose, we introduce the Modeler, that based on the Sampler, 
generates a certain type of performance models automatically. These models form the base for our 
performance prediction and algorithm ranking (Chapter 4). 
The Modeler is developed in the following design process: 

• We begin by studying the dependence of performance on various arguments of DLA routines 
(Section 3.1). The focus is BLAS, the most fundamental routines. 

• Using the gained understanding on the dependencies, we develop the structure of the perfor- 
mance model that the Modeler shall generate (Section 3.2). 

• With the desired model structure in mind, we develop the Modeler, which generates these 
models automatically (Section 3.3). 

• We conclude by evaluating how well the Modeler fulfills its objective — the reliable generation 
of accurate performance models (Section 3.4). In the process, we illustrate how the Modeler 
can be configured to obtain models with certain properties. 



3.1 Preliminary Experiments 

In the experiments in Section 2.2, we have already determined a few interesting aspects of perfor- 
mance measurements: 

1. The first measurement is always an outlier; 

2. When a routine is executed repeatedly, most performance metrics produce fluctuating values; 

3. The performance changes significantly depending on whether the routine's arguments are in 
cache or main memory. 

All of these aspects are taken into account in the designs of our models in Section 3.2. To avoid 
their influence in the experiments in this section, we take the following measures: 

1. The first measurement is always discarded; 

2. Each experiment is repeated 100 times and the median of this series is presented as a result; 
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3. Successive routine invocations use the same memory locations for their arguments (static 
memory policy). 

We focus on the dependence of performance on the types of arguments encountered in BLAS. 
(An introduction to BLAS is given in Appendix A.) We consider each argument type and its 
influence on performance individually: 

• Discrete 1 arguments, such as uplo or trans (Section 3.1.1), 

• Size arguments m, n, and k (Section 3.1.2), 

• Scalar arguments, for instance alpha or beta (Section 3.1.3), 

• Vector and matrix arguments, such as A, B, or x (Section 3.1.4), and 

• Leading dimension or increment arguments like IdA and incx (Section 3.1.5). 

Throughout the experiments in this section, we take measurements on one core of a Quad-Core 
AMD Opteron Processor 8356 [8] running at 2.30GHz. This processor can issue 2 double precision 
floating point instructions per cycle; therefore, its theoretical peak performance is peak _f tops/ 's = 
2 x 2.30 • 10 9 operations per second. 

While we consider ticks as a representative performance counter in most experiments, the 
gained insights are also applicable for other performance counters, such as flops or Limisses. 

3.1.1 Discrete Arguments 

In order to understand the influence of discrete arguments on performance, we consider dtrsm 
(B <— A~ 1 B 1 A triangular). This routine is an ideal candidate, since it covers all types of discrete 
arguments that appear in BLAS: side, uplo, transA, and diag. For our experiments, we vary 
these arguments and fix the remaining ones: 

side uplo transA diag m n alpha A IdA B ldB 
dtrsm(side , uplo, transA, diag, 256, 256, 0.5 , A, 256, B, 256). 

Measurements of the performance counter ticks for all possible combinations of discrete ar- 
gument values are shown in Figure 3.1. At a first glance, we can already see that every BLAS 
implementation shows a different behavior; we cannot find a consistent pattern across all of them. 
However, we can find several similarities common to at least some of the implementations. 

First of all, diag has only very little influence on performance in any of the implementations. 
Only for MKL 2 (•) and ATLAS (•), we observe that diag = U results in a slightly lower runtime. 
(This is not unexpected, since U indicates that A is unit-triangular, therefore requiring 256 floating 
point operations less.) 

When we consider the discrete arguments side, uplo, and transA, we observe a coupled de- 
pendence. In Figure 3.1, we see that the pattern on the left (side = L) is inverted (mirrored along 
a horizontal line) with respect to the same pattern on the right (side = R). (This is understand- 
able, since side determines the order of operations — A~ X B for L and BA~ X for R.) Additionally, 
ATLAS (•) is on average faster for side = R, while MKL (•) is slower for this case. 

Considering side = L (the left half of Figure 3.1) and looking at GotoBLAS2 (•) and ATLAS (•), 
we find that when uplo = L, transA = N is faster than transA = T, while the opposite is the case 
for uplo = R; for MKL (•), transA = T is faster than transA = N independent of uplo. 

1 Meaning with a finite value range. Arguments that take integer values are referred to as continuous. The usage 
of "discrete" and "continuous" thus differs from their purely mathematical meaning. 

2 Since MKL is designed for Intel architectures, it attains lower performance on this AMD system. 
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Figure 3.1: Dependence of ticks on discrete arguments in dtrsm. 



All these observations have shown that side, uplo, and transA have a significant impact 
on performance and that the dependency cannot be described in a consistent manner across all 
implementations. When we aim at modeling this dependency, our only option is to generate a 
separate model for each of combination of these parameters. 

diag, on the other hand, was shown to only have a minor impact on performance, therefore, 
we are inclined to ignore this argument in order to reduce our model complexity. 



3.1.2 Size Arguments 

With the exception of dgemm (C <— aAB + j3C), all BLAS Level-3 routines have 2 size arguments; 
dgemm has 3: m, n, and k. For this reason and since dgemm is usually the most optimized BLAS rou- 
tine, we use it in the following experiments in order to understand the dependence of performance 
on the size arguments. We consider the routine invocation 

transA transB m n k alpha A IdA B ldB beta C ldC 
dgemm ( N , N , m , n , k , 0.5, A , m , B , fc , . 5 , C , m ) 

and perform three experiments; in each, we vary one of the size arguments m, n, and k (and 
the corresponding leading dimension) between 8 and 512 while the other two are fixed to 256. 
Figure 3.2 shows the performance metrics ticks and efficiency for these experiments. From these 
plots, we can learn two important things: 

• Each of the three size arguments m ( ) , n ( ) , and k ( ) has a different influence on 

the performance of the routines. 

Since the number of mathematical operations for dgemm is rank + 2mn, one might expect 
similar dependencies on m and n and a different one for k. However, the results show that 
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Figure 3.2: Dependence of ticks and efficiency on size arguments in dgemm. 



this is not the case: For GotoBLAS2 ( ), m is the argument which, compared to n and k, 

produces significantly different behavior; the same is true for MKL ( ) and n. 

While in this experiment, we only varied one of the size arguments while the others remained 
fixed, in general, all size arguments may vary independently, leading to nonlinear influences 
on performance. The only possibility to capture this behavior is to represent the performance 
as a function defined on a three dimensional argument space - one dimension for each size 
argument. 

• The performance dependency on size parameters is generally nonlinear. Especially Fig- 
ure 3.2b shows the high degree of non-linearity — in this efficiency graph, a linear function 
would result in a clean hyperbola. 

However, it is possible to identify value ranges which are almost linear. For instance, we 

could approximate ATLAS's ticks ( ) roughly by two linear functions: one ranging from 

to about 200 and a second one for the remaining value range. In a similar fashion, we can 

approximate the performance dependency on single size arguments for GotoBLAS2 ( ) 

and MKL ( ) by several linear functions. 

If we extend this approach to accommodate for all sizes arguments simultaneously, we obtain 
multivariate piecewise polynomials. 

These experiments have analyzed the performance dependence on a fairly large scale. Now, we 
take a closer look at a very fine scale: We once more consider dgemm; this time for m = n = k e 
{32, . . . , 64}. ticks and efficiency for these experiments are given in Figure 3.3. 

These plots, especially for ATLAS ( ), show fine oscillations. While we do not want to 

represent these oscillations, we have to be aware of them during the generation of models. For 
this purpose, we only consider samples in small intervals in order to obtain smooth performance 
dependencies (e.g., intervals of 8 result in , , and in Figure 3.3). 
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Figure 3.3: Dependence of ticks and efficiency on size arguments in dgemm at small scale. 
3.1.3 Scalar Arguments 

In BLAS, scalar arguments multiply matrices and vectors, for instance a and /3 in dgemm (C -s— 
olAB + f3C) . This generally leads to one multiplication (floating point operation) per entry of the 
scaled matrix or vector. There are a few exceptions: Multiplications by —1, 0, and 1 do not require 
multiplications. Especially —1 and 1 are very common in application codes. 

In the following experiment, we once more consider dgemm; this time with the following argu- 
ments: 

transA transB m n k alpha A IdA B ldB beta C ldC 

dgemm( N , N , 256, 256, 256, alpha, A, 256, B, 256, beta, C, 256). 



For both alpha and beta we consider the values 0, 1, —1, and 0.5 (as a representative for the 
general case). The resulting ticks are shown in Figure 3.4. 

Evidently, all implementations take advantage of the special cases where alpha = 0: dgemm re- 
duces to C 4— j3C — scaling a matrix. Compared to matrix-matrix multiplication with complexity 
0(n 3 ), this only requires 0{n 2 ) floating point operations. As a result the execution time drops 
significantly across all implementations. If now, beta = 1, no work has to be done at all and the 
execution time is effectively 0. 

The other values for alpha (1, —1, and 0.5) do not result in different execution times. 

Considering beta, we see that GotoBLAS2 (•) and MKL (•) show a behavior different from 
ATLAS (•): GotoBLAS2 (•) and MKL (•) show a decreased execution time only for alpha = 1, 
while ATLAS (•) apparently only take advantage of alpha = 0. 

We can conclude that some special values of scalar arguments trigger different branches in 
BLAS routines, similar to discrete arguments. However, Since the exceptional case alpha = does 
not appear in any well written application, the overall influence of scalar arguments on performance 
is rather small compared to the impact of other arguments. Therefore, we decide not to represent 
this dependency. 
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Figure 3.4: Dependence of ticks on scalar arguments in dgemm. 



3.1.4 Vector and Matrix Arguments 

As discussed earlier, the entries of matrix and vector arguments have no effect on the computation. 
Therefore, these arguments do not influence any performance metrics. 

3.1.5 Leading Dimension and Increment Arguments 

Leading dimension and increment arguments specify the distance of adjacent entries in a row of a 
matrix or vector. They can potentially influence the access patterns of both the CPU's cache and 
main memory, and thus affect performance. 

To study the actual influence of these arguments, we once more consider dgemm (C aAB+/3C) 
and vary the three leading dimension arguments IdA, ldB, and ldC separately between 128 and 
256. The other arguments are fixed: 

transA transB m n k alpha A IdA B ldB beta 

dgemm( N , N , 128, 128, 128, 0.5, A, IdA, B, ldB , 0.5, C, ldC). 

Measurements of the resulting ticks and Li misses are given in Figure 3.5. 

Again, we can observe different behaviors across our BLAS implementations: GotoBLAS2's 

performance ( ) is independent of the leading dimension arguments, while MKL ( ) and 

ATLAS ( ) show small oscillations for increasing leading dimensions. For MKL ( ), for 

instance, we find that ticks decrease slightly for increasing IdA and spikes again at constant 
intervals of 32; Lunisses increase correspondingly. 
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Figure 3.5: Dependence of ticks and Limisses on leading dimension arguments in dgemm. 



Overall, however, we conclude that the leading dimension arguments only have a minor influ- 
ence on performance compared to other arguments. Modeling this influence would increase the 
complexity of our models tremendously; we usually decide to not represent it. 



3.2 The Targeted Models 

Based on the observations made in the previous sections, we now introduce the type of the perfor- 
mance models we aim to generate. We begin by defining the model structure for the general case in 
Section 3.2.1 and then show how a concrete model of this type may be evaluated in Section 3.2.2. 



3.2.1 Creation and Reasoning 

In order to devise the structure for our performance models, we first recall all significant observa- 
tions that we have to account for: 

(a) Performance varies significantly between different implementations of BLAS (Section 2.2 and 
Appendix A. 4). 

(b) Repeated executions of a single routine with fixed arguments results in fluctuating performance 
(Section 2.2). 

(c) Performance is influenced by the memory locality (e.g., cache or main memory) of the matrix 
(and vector) arguments (Section 2.2). 

(d) Many discrete arguments and special values for scalar arguments trigger separate branches in 
BLAS routines. These branches can potentially lead to a completely different performance 
behavior (Sections 3.1.1 and 3.1.3). 
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(e) Performance generally depends non-linearly on size arguments are generally non-linear; how- 
ever, it can often be represented by multivariate piecewise polynomials (Section 3.1.2). 

(f) Leading dimension arguments only have a minor influence on performance (Section 3.1.5). 

Before we turn to the construction of models, we need to clarify what we understand under the 
term "performance model": 

A performance model for a DLA routine is a function that, given values for 
the routine's arguments, provides estimates for a set of performance metrics 
for its execution in a certain environment. 

This definition separates the observations enumerated above into two categories: 

• Implementation and memory locality ((a) and (c)) depend on the execution environment. 
Therefore, we will create different models for different BLAS implementations and memory 
locality situations. 

• Fluctuations and argument values ((b) and (d) through (f)) are covered by a single perfor- 
mance model. 

To build the structure our models, we begin by considering the routine arguments. Since some 
of these only have very little influence on performance, we only account for a subset of arguments 
in our models, called the model parameters. 

The aspects that can most significantly influence the performance behavior are discrete ar- 
guments and special values for scalar arguments (d): they can trigger the execution of different 
code branches. To reduce the model complexity and the effort spent on generating it, we select a 
subset of these — the discrete parameters. The argument type determines if it is modeled; diag 
and scalar arguments are usually omitted. In our models, each combination of discrete parameter 
values — referred to as a (discrete) case — is treated separately. Example: If we consider three 
discrete parameters with two possible values each, we have 2 3 = 8 cases; each of them is modeled 
independently. 

For every case, we introduce a separate (sub-)model for each performance metric. The remaining 
aspects (e), (f), and (e) are treated for each case and each performance metric separately. 

We now turn to the dependence of performance on size and leading dimension arguments ((e) 
and (f)), which can both take non-negative integer values. In our models we offer to represent 
both argument types. However, we have observed that leading dimension arguments ((f)) only 
have a minor influence on performance. Therefore, we once more select a subset of size and leading 
dimension parameters (usually only the former) called continuous parameters . 

In principle, continuous parameters can take arbitrarily large values. We, however, limit our 
models to a certain range — for example, values between 8 and 1024. Additionally, in order to avoid 
the small scale performance oscillations (Section 3.1.2), we restrict the allowed parameter values 
to multiples of m in gap: a small step size, such as 4, 8 or 16. The collection of the sets of allowed 
values for all continuous parameters spans a product space, referred to as (continuous) parameter 
space. Example: Two continuous parameters within {8, 16, . . . , 1024} span the parameter space 
{8, 16, ...,1024} 2 . 

As suggested in Section 3.1.2, our choice to represent the dependencies on continuous parameters 
are multivariate piecewise polynomials. In ID, these are several intervals of the parameter axis each 
associated with a polynomial. In higher dimension, however, piecewise polynomials could in general 
consist of arbitrary regions of the parameter space associated with multivariate polynomials. We 



3 Continuous not in the mathematical sense but in contrast to discrete parameters, which can only take two 
values. 
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restrict the type of regions to (hyper-)cuboids with faces parallel to the parameter axes (intervals 
in ID, rectangles in 2D, and cuboids in 3D). These can be represented by two points, determining 
lower and upper limit of the region along each coordinate direction. 

It remains to incorporate aspect (b) : The fluctuations of performance counters for fixed argu- 
ment values. These fluctuations can only be represented in a statistical or probabilistic fashion. In 
principle, we would like to create a probability distribution for each performance counter at every 
point. Since this is not possible in practice, we limit our models to a vector of statistical quanti- 
ties: minimum, average, median, standard deviation, maximum, and possibly others. Therefore, 
we associate each polynomial region with a vector-valued polynomial, providing these quantities. 
As a result, the model for one discrete case is a vector-valued multivariate piecewise polynomial. 

3.2.2 Evaluation of a Model at a Point 

We now show how a model of the previously introduced type is evaluated given values for the 
modeled routine's arguments. We take dtrsm (B <— A~ 1 B, A triangular) as an example: 

dtrsm(side, uplo, transA, diag, m, n, alpha, A, IdA, B, IdB). 

We consider a performance model with the following properties: 

• side, uplo, and transA are its discrete parameters, resulting in 8 cases; 

• m and n are the continuous parameters, both taken from the range {8, . . . , 1024}; 

• diag, alpha, IdA, and IdB are neglected; 

• It provides estimates for ticks, flops, and Limisses; 

• It represents the performance through the statistical quantities minimum, average, standard 
deviation, and maximum. 

Figure 3.6 visualizes the evaluation of this model for the exemplary argument values 

side uplo transA diag m n alpha IdA IdB 

dtrsm( L , U , N , N , 256, 512, 0.7 , A, 512, B, 512). 

Given this tuple of arguments (L, U, N, N, 256, 512, A, 512, B, 512), the first step in the model 
evaluation is to extract the model parameters. In our case, these are: 

(side, uplo, transA, m,n) = (L, U, N, 256, 512). 

These are separates into discrete and continuous parameters: 

• The discrete parameters (size, uplo, transA) = (L,U, N) determine the case; 

• The continuous parameters (m, n) = (256,512) identify a point in the two dimensional con- 
tinuous parameter space. 

For the case (L,U, N), the model has one vector valued multivariate piecewise polynomial 
P : {8, . . . , 1024} — > N 4 for each performance counter; the result vector contains the statistical 
quantities. P is represented by a set of rectangular regions that cover the two dimensional pa- 
rameter space, each associated with one vector valued piecewise polynomial. The point (256, 512) 
lies within at least one of these regions. For the case of overlapping regions, the model accuracy 
of each of them is stored along with its polynomial; the region with the most accurate model is 
selected. 
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input: (L, U, N, N, 256, 512, 0.7, A, 512, B, 512) 
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Figure 3.6: Structure and categorization of performance models and their evaluation. 



For the identified region, the associated vector valued polynomial is evaluated at the point 
(256,512). The resulting vector contains the quantities minimum, average, standard deviation, 
and maximum for one performance counter. 

The process — identifying the region and evaluating its polynomial — is applied to all perfor- 
mance counters: ticks, flops, and Li misses. The result of the model evaluation is a vector with 
the statistic quantities for each of them. 



3.3 The Modeler 

In the previous section, we have described the structure of our targeted performance models. We 
now introduce the tool that automatically generates these models: the Modeler. 
It is implemented in Python mainly for the following reasons: 

• Python has extensive high-level built-in functionality for lists and dictionaries, including 
functional programming constructs; 

• It is completely object oriented; 

• It can easily interface with other programs (in our case the Sampler); 

• Its scientific library SciPy provides mathematical tools, such as least-squares solvers. 

The Modeler uses an iterative approach, starting with an initial set of samples and a very 
rough or small models. Depending on the accuracy, more samples are taken to expand or refine 
the models. The main program flow is as follows. 

1. Initialization: 

• Read a configuration file. 
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• Initialize one instance of the Sampler. 

• For every routine that needs to be modeled, create a separate Routine Modeler (RMod- 
eler). 

2. While not all RModelers have completed modeling their associated routines' performance do 
the following: 

(a) Retrieve a list of sampling requests from each of the RModelers; 

(b) Execute the sampling requests by the Sampler; 

(c) Let the RModelers refine their models using the sampling results. 

3. Finalization: Retrieve the generated models form the RModelers and write them to a file. 

The Modeler is limited to one Sampler; models for different BLAS implementations and memory 
locality situations can be obtained through the Modeler configuration file. 
The Modeler consists of several, separately discussed components: 

• The Sampler Interface (Section 3.3.1); 

• The RModelers (Section 3.3.2); 

• The Piecewise Polynomial Modelers (PModelers) as part of the RModelers (Section 3.3.3). 
3.3.1 Sampler Interface 

The Sampler Interface (SI) serves as a Python wrapper for the Sampler, handling and passing on 
sampling requests and returning the resulting measurements. It keeps a list of all measurements 
in a memory file; when the Modeler is executed repeatedly with the same Sampler configuration, 
measurements from this file are used to reduce the time spent on sampling. 

Upon initialization, the SI loads the memory file corresponding to its Sampler's configuration; 
if no such file exists, a new memory file is created. Then, the Sampler process is started and OS 
pipes are used to exchange data. 

The RModelers pass sampling requests to the SI in the form of tuples, for instance 

(dtrsm, L, U, N, N, 256, 512, v.5, 131072, 512, 262144, 512). 

All incoming requests are collected and processed together, once the Modeler instructs the SI to 
do so. 

When this happens, the SI first scans its memory file for results matching the collected requests; 
the matches are added to an initially empty result list. Each entry in the memory file is served 
at most once per execution of the Modeler; Identical sampling requests receive different sampling 
results. 

Those remaining requests that cannot be served from the memory file are passed to the Sampler. 
Every resulting measurement is both stored in the memory file and added to the result list. Once 
all sampling requests are covered by this list, each RModeler — identified by its routine's name — 
is passed exactly those results that correspond to its routine. 

Upon termination, the Sampler Interface stores its memory file. 
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3.3.2 Routine Modeler 

A Routine Modeler (RModeler) creates a performance model for a single routine. Its main task is 
to provide a layer of abstraction between two components: the Sampler Interface and the Piecewise 
Polynomial Modelers (PModelers), whose job is to create models of the form 

f ' J^^ con *^ nuous parameters ^ ^^statistical quantities 

The abstraction, as seen from the Modeler and the Sampler Interface, consists of the following 
stages: 

Stage 1 Selection of parameters from the routine's argument list; 
Stage 2 Separation of discrete and continuous parameters; 
Stage 3 Treatment of the discrete cases; 

Stage 4 Handling of PModelers for each case and performance counter. 

The RModeler provides three major functions, each of which covers the four stages of abstrac- 
tion: 

• Generating of sampling requests for the Sampler Interface (Section 3.3.2.1); 

• Passing the obtained sampling result to the PModelers in order to improve their models 
(Section 3.3.2.2); 

• Constructing an exporting the final model (Section 3.3.2.3). 
3.3.2.1 Generation of Sampling Requests 

The PModelers generate sampling requests in order to improve their models. These requests are 
processed by the RModeler according to the four stages of abstraction and then passed to the 
Sampler Interface. 

Each PModeler constructs a piecewise polynomial model / for one performance counter pi for 
one of the discrete cases Cj of the form 

f ' f^^ con * muous parameters ^ ^^statistical quantities 

In every iteration of the Modeler's main loop, it generates a list of requests lj; the requests are 
points within the continuous parameter space, for instance (256,512). The RModeler needs to 
transform these tuples into a complete sampling requests, such as 

(dtrsm, L, U, N, N, 256, 512, v.5, 131072, 512, 262144, 512). 

This is done as follows: 

Stage 4 The RModeler begins by merging the request lists Z] of all PModelers for the same discrete 
case Cj into one list lj. It is not necessary to keep track of which PModeler issued which 
requests, since all performance counters are measured for each sampling request. Every 
PModeler is afterwards given as many results to construct its model from as possible — even 
those which it did not request. 

When lj is created, all duplicates across different PModelers are removed: When two or more 
PModelers request samples at the same point, the maximum number of samples at this point 
appears in lj — not the sum of all requests. 
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Next, lj is compared with the list of sampling results for the case Cj that the RModeler 
has already received for previous sampling requests; it is possible that a PModeler requests 
samples at points that were already sampled for other PModelers. Every point in lj that is 
found in the sampling results is removed from the list as many times as measurements are 
available for it already. This ensures that each PModeler receives at least as many samples at 
each point as specified. (It also means that when a PModeler wants to increase the number 
of samples at a point it previously requested, it needs to specify the total number of samples 
it requires — not the difference.) 

The output of this stage of abstraction is a list of sampling points lj for each discrete case 

Cj. 

Stage 3 Next, the RModeler creates a list that contains all lists lj, each associated with its case 
cj- {(h,ci), (h,c 2 ), . . .). 

Stage 2 Then, the discrete parameters for each case are incorporated into the sampling requests, 
resulting in a single list of requests I. Requests in this list are of the form (ci,p), for instance 
(L,U, N, 256, 512) when three discrete parameters constitute each case. 

Stage 1 In the last step, the RModeler needs to add the routine name and values for the missing 
routine arguments to each request in I, for example, 

(dtrsm, L, U, N, N, 256, 512, v.5, 131072, 512, 262144, 512). 

An argument that does not correspond to a parameter is assigned a value according to its 
type: 

Discrete arguments are given a default value. 

Size arguments are also given a default value. However, usually all size arguments are model 
parameters. 

Scalar arguments receive a default value, such as v.5. (It is also possible to configure these 
as discrete parameters, for example, with values —1, 0, 1, and .5.) 

Leading dimension arguments are handled in one of two ways: 

• They are given a large default value, such as 2500. This represents the situation 
where the corresponding matrices are submatrices of a larger matrix. 

• They are set exactly to the height of the corresponding matrix, which is given 
by one of the size arguments. In order to determine the correct size argument 
the dependency between size and leading dimension arguments (which may in turn 
depend on the discrete arguments, such as transA or side) is encoded in the Python 
representation of the routine signature. 

Increment arguments are set to either 1 or a larger default value to represent the access of 
a row of a matrix. 

Matrix (or vector) arguments are set to the number of element that their matrices (or 
vectors) occupy in memory. This number is the product of the leading dimension (vector 
length) argument and the width of the matrix (1). Which size argument is involved may 
again depend on the discrete arguments and is determined by the Python representation 
of the routines signature. 

The result of these four steps is a list of sampling requests that is passed to the Sampler Interface. 
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3.3.2.2 Processing Sampling Results 

When an RModeler receives results from the Sampler Interface, each of them consist of the sampling 
request and one measurement for each performance counter. 

Before the RModeler can present the results to the PModelers, they need to be processed 
according to the four stages of abstraction: 

Stage 1 The RModeler begins by extracting the parameters from the sampling requests, which 
consists of values for all routine arguments. 

Stage 2 These are then split into discrete and continuous parameters. 

Stage 3 The results are added to the list of all results for the corresponding case, which is deter- 
mined by the discrete parameters. 

Stage 4 Each PModeler is then given a list of results corresponding to its case, containing only 
the measurements for its performance counter. The RModelers immediately use the results 
to improve or advance their models. 

3.3.2.3 Assembly of the Model 

When all models are complete, each RModeler retrieves the piecewise polynomials from its PMod- 
elers and assembles a model object. This independent object is then written to a file and later 
used independently of the Modeler. 

The structure of this model was introduced in Section 3.2 and resembles the four stages of 
abstraction. This becomes apparent when we consider its evaluation for given routine arguments: 

Stage 1 The model parameters are extracted 

Stage 2 They are separated into discrete and continuous parameters; 
Stage 3 The discrete parameters determine a case; 

Stage 4 For this case, all piecewise polynomials are evaluated at the continuous parameters. 
3.3.3 Piecewise Polynomial Modelers 

In Section 3.1.2, we saw that performance depends on the size arguments of BLAS routines in 
the form of piecewise polynomials. Constructing such polynomials by hand is rather simple given 
a large number of samples. By contrast, the Piecewise Polynomial Modelers (PModelers) build 
models that closely fit the performance automatically and with as few samples as possible. 

The process for modeling performance involves three steps. First, a set of sampling points 
is chosen for an initial region within the parameter space. The measurements resulting from 
these sampling points is modeled through least squares fitting as a polynomial. According to the 
polynomial's accuracy, the region is reshaped, more regions are generated or the model is accepted. 
These steps are repeated until the whole parameter space is covered with accurate models. 

We offer two PModeler implementations, which differ in both their choice of sampling points 
and the strategy they use to cover the parameter space with regions. Before we present the 
implementations in Sections 3.3.4 and 3.3.5, we discuss their common basis. 
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3.3.3.1 Polynomial Fitting through Least Squares 

The approximation of a set of sampling results by polynomials through least squares fitting is a 
task common to all Piecewise Polynomial Modelers. Every performance counter is represented by 
a set of statistical quantities, such as minimum or average; each quantity is fitted by a separate 
polynomial. In this section, we introduce polynomial fitting for a single quantity. 
The fitting procedure has the following inputs: 

• A list of coordinates {x l5 . . . , x„} £ N d from the d dimensional parameter space; 



A value for each coordinate v = (vi, . . . , v n ) 



T. 



A set of monomials 4 {mi, . . . , mi,} C {/ : N d — > N} as a base for the polynomials. 

b 

Its goal is to find a polynomial -P(x) = a^m^x) with coefficient vector a = (ai, . . . , ab) T £ 



such that (-P( x i) — v i) 2 is minimal. This is know as a least squares problem and can be written 
as argmin||Xa — v|| 2 , where 



X 



l'm 1 (x 1 ) m 2 (xx) ••• m^Xi)^ 
toi(x 2 ) m 2 (x 2 ) ••• mf,(x 2 ) 



\to!(x„) m 2 (x„) ••■ m 6 (x n )/ 



Its solution can be obtained as a = (X T X)~ 1 X T v or by the means of singular value decompo- 
sition (SVD). We use the function linalg. IstsqO provided by Python's SciPy package, which is 
based on SVD. 

For small values of x^ and v,-, this least squares solver yields very accurate results. If, however, 
the Xi and Vi are concentrated in a small area far away from the origin, the solution accuracy 
decreases significantly. This is not a problem of the solution method but rather due to the condi- 
tioning of the least squares problem. 

In order to improve the solution accuracy, we need to improve the conditioning of X. For this 
purpose, we translate the problem to the origin, solve the translated problem, and translate the 
solution back to the original problem (see Figure 3.7). We begin by translating the coordinates x^ 
and values vi (») to x^ and Vi (») such that they are evenly distributed around the origin: 



^ n I n 

Xj = Xj X,- and v { = Vi } y ' 



, , 3 ' 

n £ . — ' n * — ' 

3=1 j=i 

, , — 1 1 2 

These quantities form a new least squares problem argmin Xa — v ; its solution defines a poly- 

a 

„ b 

nomial P(x) = Oimj(x) ( ). The solution P ( ) to the original problem is now obtained 

j=i 

through a second translation: 



n 1 n 

P(x) = P(x-I S x j ) + i S 



3 " 

J I " 3=1 



4 A polynomial with only one non-zero term, such as 1, x\, or x\x\. 
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Figure 3.7: Translation of the least squares problem. 



In addition to this coordinate translation, our least squares fitting mechanism rounds coeffi- 
cient to close rational numbers or discards relatively small coefficient according to a configurable 
threshold. This is especially useful for performance counters such as flops, where all coefficients 
are usually rational numbers with small denominators. 

3.3.3.2 Approximation Accuracy 

The accuracy of a polynomial approximation P for coordinates Xj and values Vi is determined by 



the local errors ej = P(xi) — Uj. The used least squares approach minimizes ^ ef ; this quantity 

<=i 

could be used to measure the approximation accuracy. We, however, use the maximum relative 
error across all Xf. 



For a vector valued polynomial representing the statistical quantities, we select the approxima- 
tion error in one of the statistical quantities to represent the approximation's accuracy; usually, 
we choose the median. 

3.3.4 PModeler: Model Expansion 

In this section, we introduce the first of two Piecewise Polynomial Modelers (PModelers) presented 
in this thesis. The Model Expansion strategy creates piecewise polynomials as follows: 

• It begins by modeling a small region in a corner of the parameter space through a single 
polynomial (Section 3.3.4.1); 

• This region is expanded as far as possible (Section 3.3.4.2), as long as 

— its polynomial's approximation accuracy is above a given threshold, and 

— it stays within the boundaries of the parameter space; 



n 
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• Once a region cannot be extended further, new adjacent regions are generated (Section 3.3.4.3). 

The process of modeling, expansion, and region generation is repeated for all regions until the 
whole parameter space is covered. 

Models are expanded either from the origin of the parameter space towards its maximum or 
in the opposite direction. The initial domain is accordingly chosen at the minimum (bottom left) 
or maximum (top right) corner of the parameter space. Without loss of generality, we introduce 
Model Expansion by considering the expansion away from the origin. 

3.3.4.1 Initial Model 

The size of the initial region is chosen with both the resulting model and the expense of sampling 
in mind: Smaller regions lead to a higher resolution of the model, while larger — thus, fewer — 
regions require less sampling. We found a length of 64 or 128 along each parameter direction to 
be a good compromise. 

Within the initial region, the sampling points are chosen on a regular grid. For a targeted 
polynomial order o, this grid has least o + 1 points along each parameter direction; for a d- 
dimensional parameters-space, the grid consists of (o+l) d points. At the expense of more sampling 
requests, the grid density can be increased to generate smoother models. 

Once a first model is created, its approximation error is computed. If this error is below the 
configurable threshold, the region is expanded (Section 3.3.4.2); otherwise, new adjacent regions 
are created immediately (Section 3.3.4.3). 

3.3.4.2 Region Expansion 

While the approximation error of a region is below the target threshold, the region is expanded as 
far as possible, as long as 

• The parameter space boundary is not reached, and 

• The approximation error stays above the threshold. 

Expanding away from the origin, initially, only the lower limit b = (pi,... ,bd) (its base) of the 
region's extent is fixed. An upper limit, denoted by c = (ci, . . . , c^), has yet to be determined. 
However, we can already give both a lower and an upper bound for c: The lower bound 1 = 
(li, . . . , is the upper limit of the initial region's extent; the upper bound u = (ui, . . . , Ud) is the 
maximum corner of the parameter space. During the process of model expansion, the bounds 1 
and u are increased and decreased, respectively, until they converge to one point c. 

We denote the region of the parameter space spanned by two points a = (ax,...,a<j) and 
b = (bi, ...,b d )by 

Rl = {(xi, ...,x d )e N d \Vi G {1, . . . , d}. ai < Xi < h}. 

At this point either the end of the parameter is reached or the approximation error is just below 
the threshold. 

In each step of the model expansion, a set of sampling points P C R£ is chosen. For this 
purpose, a parameter value pi is selected along each parameter direction i by means of the first of 
the following rules that is applicable: 

(a) When Ui ~ li > maxgap, select Pi = it$ + maxgap; 

(b) In the first step, when ui —l%> maxgap, select = Ui (this reduces the number of steps in 
case that Cj = Ui); 
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Figure 3.8: Choice of sampling points for Model Expansion. 



(c) When U + mingap > u,, select £>j = itjj 

(d) Otherwise, select pi = [ li \ Ui \ mlngap t resulting in a binary search pattern 5 

This results in one sampling value along each direction i. 

Next, a set of sampling points P is generated from the sampling values pi 



P 



d \ / d 

X{bi,k,Pi}) \ ( X{bi,h} 



Example. Let us study an example illustrating the construction of P (Figure 3.8a). We consider 
a two-dimensional parameter space (d = 2) with an initial region (H) in the lower left corner. This 
region determines the base b = (61,62) and the initial lower approximation bound 1 = (Zi, Z2) for 
c = (ci, C2); the upper bound u = (ui, U2) is given by the upper limit of the parameter space. 
The parameter values pi and pi are determined as follows (see Figure 3.8a): 

• For pi, rule (a) applies, since > maxgap, resulting in pi = h + maxgap. 



• For P2, none of the first three rules apply, since (b) " 2 2 1-1 < maxgap, (c) U2 — I2 > maxgap, 
and (c) U2 — fa > mingap. Therefore, P2 = [ ' 2 ^" 2 j mingap 1S chosen according to rule (d). 

Given pi and p 2 , P (•) is constructed: 

P = {{bxM,Vi} x {&2, fa,P2}) \ ({bi,fa} x {b 2 , fa}) 

= {(61,62), (bi,fa), (bi,p 2 ), (pi,6 2 ), (pi, fa), (pi,P2), (h,b 2 ), {fa, fa), (luPn)} 

\{(bi,b 2 ),(b 1 ,fa),(h,b2),(h,fa)} 
= {(h,P2), (Pi,b 2 ), (Pi,fa), (pi,Pa), (k,P2)}- 
The set P forms a hull around the initial region (□) with points on all intersections of hi, li, and 

Pi- □ 



\x\ y := max{i • y\i E N, i ■ y < x} denotes the largest multiple of y that is smaller or equal to x. 
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Once the sampling results for the points P are available, the expansion of the model along each 
direction i is considered separately. For direction i, the region of the model is tentatively expanded 
only along this direction up to and including pi , resulting in a region 

all available sampling results within this region — in particular, including those at pi — a poly- 
nomial model is created through least squares fitting. Depending on how accurately the obtained 
polynomial approximates the sampling results, the lower or the upper bound Zj or Ui is updated: 

• If the approximation error is below the threshold, region is expanded up to and including pi 
by setting k «- 

• Otherwise, we know that c, ; must be below pf. therefore, it, is set mingap away from p i: since 
this is the largest value that can still attain: m 4— Pi — mingap. 

When the updated bounds Zj and Ui coincide, the maximum extent of the region along direction 
i is reached: Cj = Zj = Uj. The process of sampling and modifying 1 and u is repeated until all 
values of c are determined in the same way. Once this is the case, new regions are generated based 
on the current region's extent (Section 3.3.4.3). 

Example. Returning to the previous in Figure 3.8a, we now consider the expansion of the region 
given the new sampling results. First, the region is tentatively expanded along parameter direction 
1. For this purpose, the region R^ 1,12 ^ is considered; it reaches from b\ to p\ and b 2 to l 2 in param- 
eter directions 1 and 2, respectively. For this region, both the samples used for the construction 
of the initial model and the new samples at (p 1; 61) and (pi,l 2 ) are available. From all of these 
samples, a new polynomial is generated through least squares fitting. Let's say that this model's 
approximation error is below the threshold. Consequently, l\ is updated to: Zi <— p\. 

The updated value for Zi is now used for the tentative expansion along the second parameter 
direction up to p 2 . This region eL 1 '* 3 ' contains both the initial model's sampling points and as 
all new sampling points (»). Let's say that the polynomial model created from all these points has 
an error that exceeds the threshold. As a consequence, the region's expansion along parameter 
direction 2 cannot include p 2 . Hence, u 2 is set to the largest possible value that the region's extent 
can attain: u 2 p 2 — mingap. The resulting situation is shown in Figure 3.8b. 

Since both Zi < u\ and Z 2 < u 2 , another step of the model expansion process follows; this time 
the sampling points shown in Figure 3.8b are chosen. In this next step, the range of possible values 
for c (spanned by 1 and u)is significantly smaller than before. 

After only a few more steps, 1 and u will converge, yielding c. □ 

3.3.4.3 Region Generation 

Once the extent of a region has been maximized, new neighboring regions are generated. The 
principle behind this generation is shown in Figure 3.9: Along each expansion direction, one new 
region is generated right next to the maximized region. Unfortunately, it is not that simple. If we 
consider the two shaded regions in Figure 3.9 to be the result of the expansion process, we find that 
they overlap. In this case, it is undesirable to generate new regions right next to these, because 
they would lead to more overlapping regions and thus redundant modeling. A suitable solution 
for this particular case would be to generate a new region in the top right corner, bordering both 
regions. In this section, we introduce a procedure that generalizes this approach. 

Once more considering the expansion away from the origin, the inputs of this procedure are 
the following: 

• A region R^, that was extended as far as possible, 

• The set of all other regions 1Z, and 
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Figure 3.9: Basic concept of region generation in Model Expansion. 



• The limits of the parameter space in the expansion directions. 

From these inputs, the procedure generates a set of new regions. Each of these regions is identified 
by its base point b — the lower limit of the new region's initial extent. We now describe how the 
set S containing these base points is created. 

First, an initial set of points S is chosen; this set is then iteratively refined. The initial set 
consists of one point p = (&*, . . . , b*_ 1 , c* + mingap, b* +1 , . . . , b*j) for each parameter direction i. 
(» in the situation in Figure 3.9). Then the procedure iteratively updates S by considering the 
regions R £ 1Z until S stays the same for all regions. 

A given region R can be of two types: 

• R is in the process of expansion, or 

• R has been expanded as far as possible. 

In the first case i?'s upper limit c is not fixed yet: only a lower bound 1 and an upper bound 
u are known, such that c £ i?" (see Section 3.3.4.2). With R's base b and its upper bound u, 
its maximum possible extent is Every point p £ S that lies within this maximum extent is 
remove from S: 

Si — 5* \ . 

Since the extent of R is not fixed yet, p cannot be selected right next to it; such a base point is 
created, once the expansion of R is complete. 

In the case of a region R — i?£ £ 1Z that has been expanded as far as possible, its upper limit 
c is known. The points S D which lie within this region are replaced by a new set of points S'. 
S is then updated by 

S<- (SDR^US'. 

For this purpose, each point p £ S is associated with a direction pi, for the points in the initial 
set S these are the directions i, along which the base of was shifted to generate them. S' is 
created from the points p £ S which lie within For each such point p, one new point is created 
for each coordinate direction j ^ i p by shifting p up to Cj + maxgap along this direction. 5" is, 
hence, given by 

S' = |J {(pi,-. • ,Pj-i,Cj + mingap, p j+1 , . . . ,p d ) \ j £ {1, . . . , d} \ {i p }} . 
p&SnRg 
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Figure 3.10: Generation of new regions' base points 5*. 



Each point S' is associated with the directions i p of the point p from which it was generated. 

The update of S through S' is the reason why all regions 1Z repeatedly until S remains un- 
changed: The new points within S' can lie within other regions, which were already processed. 
Through the repeated processing, it is guaranteed that none of the suggested new base points S lies 
within any other region. The termination of the process is assured since we have a finite number 
of regions 1Z and the new points 5' are always chosen further along the expansion direction as the 
original point p. 

Example. Let us consider examples of the treatment of points in S in different situations regarding 
the overlap of other regions, as shown in Figure 3.10. In each example, we have a completed regions 
R* (□), an overlapping region R (□) one a point p (»), which was generated by shifting the base 
of the completed region R* (□) along direction i p — 1 to the right. 

• Figure 3.10a: In this scenario, the region R (□) is still within the process of expansion. 
Creating a new region at p would lead to a potentially large overlap with R. However, since 
the final extent of R is unknown, p cannot be shifted to another position (e.g., •) at the 
side of R. Therefore, p is discarded without a replacement. Once R is expanded as far as 
possible, a new region will be generated somewhere between the indicated points (•). 
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• In Figure 3.10b p (») lies within a region R (□) with a fixed limit c. Therefore, p is shifted 
along the side of R* (□) to a point p' (•), which is mingap away from R. With p = (pi,p 2 ) 
and c = (ci, C2), we obtain p' = (pi, c 2 + mingap). 

• Figure 3.10c shows a 3D version of the scenario from the previous example. From p = 
(pi,P2,P3) (•) and c = (ci,c 2 ,c 3 ), two new points p' = (p\,c 2 + mingap, p 3 ) and p" = 
(pi,P2, C3 + mingap) are generated. □ 

Once S is not changed anymore by any of the regions R £ 1Z, all points that exceed the size 
of the parameter space are removed from it. The resulting set is gives the base points for the new 
regions. 

3.3.5 PModeler: Adaptive Refinement 

Performance depends on continuous parameters in a spatially nonuniform fashion: In some regions 
of the parameters space — usually for large parameter values — this dependency is rather regular 
and smooth, while in others it can be very irregular, containing jumps, kinks, and curvature 
changes. In order to represent such structures to different degrees of detail, we now introduce an 
approach based on adaptive refinement. Similar approaches are commonly used in a variety of 
disciplines such as mesh generation and optimization. In our context, the idea is to begin with 
a simple and regular model constructed from a coarse grid of samples; then, the model's quality 
and accuracy are evaluated and it is refined in the insufficiently approximated regions by locally 
increasing the grid resolution. These steps are applied recursively to the refined regions, until 
either the accuracy suffices across the whole domain or a given resolution limit is reached. 

In the beginning, a single region is created spanning the whole parameter space. This region 
is then sampled at points on a regular grid; for polynomial approximations of order o, this grid 
consists of at least 0+ 1 points along each direction. From the resulting samples, a first polynomial 
approximation is computed through least squares fitting. If the approximation error of this poly- 
nomial is below the error bound, the model is already completed. However, it is very unlikely that 
one polynomial is accurate enough to represent the performance of the entire parameter space. 

When the accuracy of a region is insufficient, it is divided into four roughly equally sized 6 
quadrants in the 2D case (2 d hyper-cuboids in d dimensions). Those quadrants, whose size along 
any coordinate direction is smaller than a configurable minimum size are discarded; the others 
form new regions. The new regions are then sampled on a regular grid with the same number 
of points as for the initial region. New polynomials are fitted to the regions samples and their 
approximation error is computed. When this error is above the threshold, the regions are further 
refined recursively; otherwise, the model for the considered region is complete. 

In each step of the refinement, all sampling points of the coarse model are incorporated in the 
model of the refined regions. In fact, the sampling grid on the refined regions covers all previous 
sampling points within the same region (see Figure 3.11 for the 2D case); sampling for these points 
is not repeated. 

Example. An example of Adaptive Refinement for the 2D case is shown in Figure 3.12. The poly- 
nomial approximation for the initial region spanning the entire parameter space is very inaccurate 
(■, 1 st square). Therefore, it is refined, generating four new regions; new samples are generated 
for these, leading to new polynomial approximations (2 nd square). Here, the error in the top right 
quadrant (■) is already below the threshold (■); The other quadrants are not accurate enough 
and are further refined (3 rd square). Now, several regions are below the error threshold; the others 



6 The splitting point's coordinates are a multiple of mingap in order to avoid the small scale oscillations of 
performance counters. 
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Figure 3.11: Sampling points distributed on a rectangular grid. 
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Figure 3.12: Adaptive Refinement example. 

are refined once more (4 th square). Although some of the resulting regions are still of insufficient 
accuracy, they are not further refined since we do not wish to generate any smaller models. □ 



3.4 Results 

In the previous sections, we introduced the type of performance models we are interested in and 
the Modeler — a tool that automatically generates these models. We now study how the Modeler 
is used and configured to generate the models. 
We consider dtrsm (B ■<— A~ 1 B. A triangular): 

dtrsm(side, uplo, transA, diag, side, m, n, alpha, A, IdA, B, ldB). 

This BLAS Level-3 routine has four discrete arguments (side through diag), two size arguments 
(m and n) , one scalar argument (alpha) and operates on two matrices (A and B with corresponding 
leading dimensions IdA and ldB). From these arguments, we select 

• the discrete parameters side, uplo, and trans A, and 

• the continuous parameters m and n. 

The size parameters are modeled for values between 8 and 1024 with mingap = 8, that is, m, n E 
{8, 16, 24, . . . , 1024}. The remaining arguments diag, alpha, IdA, and ldB are not modeled. They 
receive the following values: 



• diag = N; 
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• alpha = 0.5; 

• IdA = ldB = 2500 (representing the access of submatrices A and B). 

We model the performance counters flops and ticks. 

All samples for the modeling process are taken on one core of a Quad-Core AMD Opteron 
Processor 8356 [8] running at 2.30GHz. We use GotoBLAS2 [3, 18] and the static memory policy 
for high memory locality (in-cache). 

Although both performance counters are modeled simultaneously, we discuss the resulting mod- 
els separately in the following sections. 

3.4.1 flops 

For the performance counter flops (number of floating point operations performed by a routine) we 
can expect a constant value across routine executions with the same arguments, since the dtrsm's 
instructions are data independent. Therefore, each sampling point is sampled only once; instead 
of a range of statistical quantities, we only need to consider one value at each point. 

3.4.1.1 Model Expansion. 

The models generated by Model Expansion are shown in Figure 3.13. The discrete parameters 
uplo and transA show no influence on flops; therefore we only consider the cases side = L and 
side = R. For both, we show the generated regions and their polynomials. The approximation 
error is 0% across the whole domain — our model represents flops exactly. 

Considering the distribution of regions, in the case of side = L (Figures 3.13a and 3.13b), we 
have regions in intervals of 224 along the dimension m. Each of them is associated with a polynomial 
model of the form 0.5mmn + am + /3n. Starting at 1.5, a increases by 1 in every interval, while 

n 

f3 can be expressed as 224« in the n-th region. The observed regions are due to GotoBLAS2's 
i=i 

blocked matrix access. 

Considering side = R (Figures 3.13c and 3.13d), we find the same structure as in the previous 
case along the n-axis. This is the case, since side change the order of the operands from B <— 
Q.5A~ 1 B to B f- 0.5BA" 1 , thus changing the blocked data access pattern. 

3.4.1.2 Adaptive Refinement 

For side = L, Adaptive Refinement covers the parameter space with the regions shown in Fig- 
ure 3.14. As for Model Expansion, these regions are arranged in a clear pattern, representing 
ticks perfectly. However, there are many rather small regions. This is the due to the strict 2x2 
subdivision pattern that is used; a comparison with Model Expansion Figure 3.13a shows that the 
small regions are required to exactly fill up the regions caused by GotoBLAS2's blocked matrix 
access. 

Since we do not gain any accuracy through the small regions, we conclude that Model Expansion 
is more suitable to model flops. The accuracy of moth modeling strategies is perfect, but Adaptive 
Refinement requires a lot more samples than Model Expansion and does not represent the structures 
found in ticks in a clean way. 

3.4.2 ticks 

Turning to ticks, we are faced with a completely different performance behavior, ticks shows 
both a not clearly polynomial dependence on the continuous parameters and fluctuations when 
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Figure 3.13: flops model for dtrsm with Model Expansion. 



one routine execution is sampled repeatedly. To accommodate for the latter, we make use of the 
statistical facilities of the Modeler: 10 samples are taken at each sampling point used to compute 
the statistical minimum, average, standard deviation, and median. We focus our analysis on the 
median, which we consider the most representative statistical quantity. 

We focus on the discrete case (side, uplo, transA) = (L,L, N). While the performance depen- 
dencies of the other cases are slightly different, they show the same structures and general behavior. 
Therefore, comparing different cases does not yield any insights regarding the modeling quality. 



3.4.2.1 Model Expansion 

Model Expansion has a few interesting configuration options for ticks 

• The relative error bound, 

• The direction of expansion (away from the origin or towards it) , and 



• The initial size of regions. 
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Figure 3.14: flops model for dtrsm with Adaptive Refinement — Regions for side = L. 

We address these options' influence by considering a series of models generated with different 
configurations. The plots in Figure 3.15 show how these models cover the parameter space with 
regions along with their relative errors. 

In Figure 3.15a, we begin with the following configuration: 

• The error bound is 10%; 

• The direction of expansion is away from the origin (/*); 

• New regions are initially of size 128 x 128. 

We observe that smaller and less accurately modeled regions are generated towards the left side 
of the parameter space. Towards the top right corner, the regions become larger and the relative 
error decreases. In this part of the parameter space, we also find several areas which are modeled 
by two or more overlapping regions 7 . 

For the second Modeler configuration, we flip the direction of model expansion (Figure 3.15b). 
We observe the following changes: 

• Especially towards the top right, the generated regions are larger; 

• These regions are of higher accuracy, although the error bound was not modified; 

• Fewer regions overlap. 

The average relative error improves from 10.4% to 6.99% while the number of required sampling 
points decreases from 65,220 to 32,680. We conclude that it is preferable to expand the models 
towards the origin {)/)■ 

In Figure 3.15c we now reduce the error bound to 5%. As a result, the average model error 
improves from 6.98% to 5.79%. This comes at a cost of an increase in the number of samples from 
32, 580 to 53, 550. As in the previous cases, the accuracy of the models decreases as the parameter 
values become smaller; the least accurate models appear for small values of m. 

Finally, we now decrease the size of initial models from 128x 128 to 64x 64 (Figure 3. 15d). While 
this has little influence in the upper right part of the parameter space, lots of small less accurately 
modeled regions are generated for smaller parameter values. This reveals a band of inaccurately 




















































































































































































7 When the model is evaluated at a point covered by multiple regions, the most accurate model is selected. 
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Figure 3.15: ticks model for dtrsm with Model Expansion. 

modeled regions stretching from the top left corner of the parameter space ((m, n) s» (8, 1024)) to 
the middle of its lower bound ((m, n) s» (512, 8)); these artifacts were previously not visible — the 
sampling was too coarse. The finer sampling and the smaller models come at the cost of 84, 710 
additional samples — making a total of 138, 290 samples. Due to the now revealed behavior of 
ticks, the average error increases from 5.79% to 5.96%. 

3.4.2.2 Adaptive Refinement 

The generation and distribution of regions in Adaptive Refinement is governed by two configuration 
options: 

• The relative error bound that decides upon whether a certain region is further refined, and 
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Figure 3.16: ticks model for dtrsm with Adaptive Refinement. 



• The minimum region size which determines the maximum depth of recursive refinements. 

We now study the influence of these options individually. The models' regions resulting from 
varying configurations are shown in Figure 3.16. 

For all configurations, the minimum region size does not allow to generate refined regions at 
the lower end of the parameter space. This is because the minimum value for m and n is 8 and not 
0. The seemingly rectangular regions are parts of larger regions that were, for this reason, only 
partially refined. 

The first model in Figure 3.16a was generated with an error bound of 10% and a minimum region 
size of 64 x 64. The result shows an overall distribution of regions similar to Model Expansion: 
smaller and less accurately modeled regions are predominant for smaller parameter values - 
especially for m. Interestingly, we can already see a structured band of less accurate models, which 
was only visible in the most accurate model generated by Model Expansion (Figure 3.15d). 
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(a) 


10% 


64 


61,880 


7.21% 


(b) 


5% 


64 


81,890 


6.32% 


(c) 


10% 


32 


134, 160 


4.29% 


(d) 


5% 


32 


227, 820 


3.17% 



(b) Adaptive Refinement 



Table 3.1: Comparison of Model Expansion and Adaptive Refinement. 

For our next model, we decrease the error bound to 5%, resulting in the regions shown in 
Figure 3.16b. To attain the higher accuracy, several of the regions in the previous model are 
further refined, especially on the left side. The increased number of regions is covered by 81,890 
samples — 20, 010 more than previously. The higher accuracy requirements lead to a decrease in 
the average error from 7.21% to 6.32%. 

For both values of the error bound (10% and 5%), we now decrease the minimum region size 
to 32 x 32 (Figures 3.16c and 3.16d). This leads to the generation of very many tiny regions. 
The error bound of 10% (5%) leads to an average error of 4.29% (3.17%) at the cost of 134, 160 
(227, 820) samples. 

When we modify the accuracy and region size options, as shown in these results, Adaptive 
Refinement has an attractive property: All samples that were taken for the less accurate models 
can be reused in the generation of the finer models (made possible by the memory file of the 
Modeler's Sampler Interface). A user of the Modeler can take advantage of this property and 
dynamically improve the models if he considers their accuracy to be insufficient. 

3.4.2.3 Comparison 

Table 3.1 reports the number of needed samples and the attained average error for the ticks 
models presented in the previous sections. Figure 3.17 visually compares the number of samples 
and accuracy of these models; we are interested in models that are as accurate as possible from a 
minimal number of samples. 

For fewer samples, Model Expansion generates more accurate models ((b) and (d)). However, 
we have to keep in mind that these models were not representing the fine scale behavior of ticks 
very well. When we are willing to use more samples, Adaptive Refinement generates more accurate 
models ((c)). With huge amount of samples, this PModeler can generate very accurate models 
((d)). 

For the models used to predict the performance of blocked algorithms, we choose Adaptive 
Refinement with configuration (c): 10% error bound and 32 x 32 minimum regions size. This 
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Figure 3.17: Comparison of modeling methods for ticks. 



configuration is a good compromise between the model accuracy and the number of samples. 



Chapter 4 

Prediction and Ranking 



We now have all necessary tools ready to tackle our main goal: ranking blocked algorithms based 
on performance predictions. To predict the performance of a blocked algorithm, we analyze its 
sequence of subroutine invocations. We use performance models generated by the Modeler to 
estimate the performance counters for these invocations. These estimates arc then accumulated, 
resulting in the prediction of the algorithm's performance. The probabilistic nature of the per- 
formance model allows us to give detailed information on the expected ranges of the algorithm's 
performance. 

Like the Modeler, our prediction tool is written in Python. The automatically generated per- 
formance models can therefore be used directly. The structure of the blocked algorithms, on the 
other hand, needs to be represented in a format that we can interpret within Python. 

After we describe the blocked algorithm representation in Section 4.1, we apply our performance 
prediction and ranking to three operations: 

• Inversion of a triangular matrix, L L _1 (Section 4.2); 

• LU decomposition, LU <— A (Section 4.3); 

• Solution of the Sylvester Equation LX + XU = C for X (Section 4.4). 



4.1 Representation of Blocked Algorithms 



To predict the performance of a certain blocked algorithm for a given set of arguments, we need a 
list of all subroutine invocations. In this list, each invocation is represented by a tuple consisting 
of the invoked routine's name and its arguments. To obtain performance estimates, these tuples 
can be passed directly to the performance models. The list of invocations is generated by a Python 
function that mimics the execution of the blocked algorithm. 



Example. We consider the blocked algorithm variant 1 for the inversion of a triangular matrix 
L 4— L^ 1 (Section 1.4.1). The algorithm traverses L from the top left corner 
the following update statements: 



L 



and performs 



Variant 1 



L\o < L xl Liq 
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int t rinvl _ ( char * diag , int* n, double* A, int* IdA , int* blocksize) { 
if (* n = 1) { 

if (diag[0] = 'N') 

*A = 1 / *A; 
return ; 

} 

int ione = 1; 
double one = 1; 
double minusone = — 1; 

int p = ; 

while (p < *n) { 

int b = *blocksize; 

if (p+b>*n) 
b = *n — p ; 

#define A00 (A) 

#define A10 (A + p) 

#define All (A + *ldA * p + p) 

dtrmm_("R", "L" , "N" , diag, &b , &p , tone, A00 , IdA, A10 , IdA); 
dtrsm_("L", "L" , "N" , diag, &b , &p , faninusone , All, IdA, A10 , IdA) 
trinvl__ (diag , &b , All, IdA, feione) ; 

p += b; 

} 

return 0; 



Listing 4.1: Inversion of a triangular matrix — variant 1. 



An implementation of this algorithm in C is given in Listing 4.1. The update L\\ <— L\ x is 
mapped to a recursive routine invocation with block-size 1. This leads to the following recursion 
levels: 

• The main algorithm invocation performs BLAS Level-3 operations on submatrices. 

• At the first level of recursion, the argument blocksize = 1 leads to matrix-vector operation. 
These are performed by the same BLAS Level-3 operations with one of the size parameters 
equal to 1. 

• on the second recursion level, we have n = 1 — the input matrix A consists of a single scalar. 
The inversion of this scalar is handled at the very top of the routine: When diag = N, its 
reciprocal is computed. 

When the algorithm is executed with the arguments 

diag n A IdA blocksize 
trinvK N , 300, A, 300, 100 ), 

that is, for a matrix of size 300 x 300 and a block-size of 100, we obtain the routine invocations listed 
in Table 4.1. The matrix arguments do not need to be specified: Their size can be computed from 
the size and leading dimension arguments and the matrix entries are irrelevant to performance. 
Within Python, the invocation list is represented as follows: 



4.2. TRIANGULAR INVERSE L <- L" 1 
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update subroutine invocation 



Lio 


<— LioLqo 


(dtrmm, R, L, N, N, 100, 0, vl, •, 300, •, 300) 


LlQ 


i L n Lio 


(dtrsm, L, L, N, N, 100, 0, v - 1, •, 300, -,300) 


Ln 




(trinvl, N, 100, -,300,1) 


Lio 


<— LiqLoo 


(dtrmm, R, L, N, N, 100, 100, vl, -,300, -,300) 


Lio 


< L lt Lio 


(dtrsm, L, L, N, N, 100, 100, v - 1, -,300, •, 300) 


Ln 


*~ Lii 


(trinvl, N, 100, -,300, 1) 


Lio 


<— LiqLoo 


(dtrmm, R, L, N, N, 100, 200, vl, -,300, -,300) 


Lio 


< L n Lio 


(dtrsm, L, L, N, N, 100, 200, v - 1, -,300, •, 300) 


Ln 




(trinvl, N, 100, -,300,1) 



Table 4.1: Subroutine invocation of dtrsm (N, 300, A, 300, 100). 



| [ ' dtrmm ' 


'R ' 


'L ' , 


'N', 'N', 100, 


0, ' 


vl ' , 


None , 300 , None , 300] , 


[ ' dtrsm ' , 


'L ' , 


'L ' , 


•N' , 'N' , 100 , 


0, 'v 


-1' , 


None , 300 , None , 300] , 


[ ' trinvl ' 


> N > 


100, 


None, 300, 1] , 








[ ' dtrmm ' , 


'R' , 


'L ' , 


•N' , 'N' , 100 , 


100 , 


'vl ' 


None , 300 , None , 300] , 


[ ' dtrsm ' , 


'L ' , 


'L ' , 


•N' , 'N' , 100 , 


100 , 


'v-1 


, None , 300 , None , 300] , 


[ ' trinvl ' 


'N ' 


100, 


None, 300, 1] , 








[ ' dtrmm ' , 


'R' , 


'L ' , 


'N' , 'N' , 100 , 


200 , 


'vl ' 


None , 300 , None , 300] , 


[ ' dtrsm ' , 


'L ' , 


'L ' , 


'N', 'N', 100, 


200 , 


'v-1 


, None , 300 , None , 300] , 


[ ' trinvl ' 


>N> 


100, 


None, 300, 1]] 









□ 



4.2 Triangular Inverse L ^— L 



The first operation we consider is the inversion of a triangular matrix, L <— L . In Section 1.4.1, 
we presented four blocked algorithms that perform this operation with the following update state- 
ments: 



Variant 1 



Lio LioLqo 

Lio < "n 

Ln <— L tl 



Lt^Liq 




Variant 3 
-Lr > iL 1 1 

J 10 



L21 4 -^21-^X1 

L20 <~ L21L1Q + L20 

Lio "11 
Ln <— L lx 



L-i 1 Lio 



Variant 4 



L21 < L22L21 

L20 < L21L10 + L20 

Lio 4— LiqLqo 
Ln 



Ln 1 



These algorithms' C implementations used for the following performance prediction is given in 
Appendix B.l. 

In Figure 4.1, we present performance measurements of these algorithms with the arguments 



diag n A IdA blocksize 

trinvi ( N , n , A , n , 96 ), 
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Figure 4.1: ticks and efficiency for trinvi (N, n, A, n, 96). 



varying the matrix size n £ {8, 16, ... , 1024}, taking 10 measurements at each point. We show the 
metrics ticks and efficiency, where 

!n 3 + in 2 + |n 

efficiency = — — 

9 2 ■ ticks 

We now turn to the prediction of these measurements ticks, for now, considering the median 
of our models. For our first performance estimate, we use performance models for dtrsm, dtrmm, 
dgemm, and the unblocked versions of the blocked algorithms 1 . These are generated with a cache- 
trashing Sampler and the Modeler configuration selected in Section 3.4.2.3 For each algorithm 
execution, we generate consider the list of subroutine invocations, containing calls to BLAS and the 
algorithms' unblocked versions. We evaluate the performance models for these routine invocations 
and accumulate the obtained estimates, resulting in our performance prediction. 

Since the efficiency graphs for the four algorithms (Figure 4.1b) give a better impression 
of their performance, we present the efficiency computed from our obtained ticks estimates in 
Figure 4.2a. These predictions are already accurate enough to determine that variants 4 ( — ) and 

1 ( ) are less efficient than variants 2 ( ) and 3 ( ). However, in our predictions, variant 

2 ( ) erroneously becomes more efficient than variant 3 ( ) for increasing n; this is not the 

case in the measurements. Furthermore, all efficiency predictions are too low; This results from 
overestimating ticks. This overestimation is due to the memory locality situations used during the 
generation of our performance models; the matrix arguments were placed in main memory. 

For our second set of predictions, we use a performance model generated with an in-cache 
Sampler configuration. The results are shown in Figure 4.2b; a zoom-in on the upper right part 
of the plot (n > 512 and 0.5 < eff iciency < 0.8) is given in Figure 4.2c. Compared to our 
previous predictions, these results show a significantly closer to the measurements. Furthermore, 
all algorithm variants are now ranked correctly. 



1 The models for the unblocked versions are limited to values of size up to 256 — large enough for the unblocked 
algorithm invocations. 
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Figure 4.2: efficiency predictions trinvi (N , n, A, n, 96). 



Up to this point, we presented estimates for the median of the performance counter. In Fig- 
ure 4. 2d we additionally take the quantities average, minimum, and maximum into account. The 
range between minimum and maximum ( ) covers almost all measurements and gives a good 
idea on the expected results; however, they are very broad — they overlap and even include each 

other. The average ( ) is closer to the measured algorithm performance than the previously 

used median. Relying on the average predictions, however, is dangerous, since they are obtained 
for models generated with an error bound on the median and are influenced by outliers. 



Block-size. So far we predicted performance fore varying matrix sizes with highly satisfactory 
results. We now turn to our second aspect of interest: determining the most efficient block-size. 
For this purpose, we now fix the matrix size to 2 n = 1016 and vary the block-size: 



2 We chose 1016, since we observed outliers for the algorithms' performance measurements at 1024 which we are 
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Figure 4.3: efficiency predictions trinvi (N, 1016, A, 1016, blocksize). 



diag n A IdA blocksize 

trinvi ( N , 1016, A, 1016, blocksize). 

The resulting predictions (Figure 4.3) represent the measurements very well for the most efficient 

block-sizes between 48 and 128. Variant 3 ( ) — the fastest — attains its highest performance 

for a block-size of 64 both in our prediction and the measurements. 

The quality of our prediction decreases for very large and very small block-sizes. 

Small block-sizes. The algorithms invoke subroutines very often with long and skinny matrices. 
For such matrices, BLAS routines usually do not attain high performance; this is carried 
forward to the blocked algorithms. Additionally, our models are less accurate for small 
argument values, further decreasing the prediction accuracy. 

Large block-size. The unblocked versions of the algorithm increasingly contribute to the over- 
all performance. Since these unblocked versions are less efficient, the overall performance 
decreases; ticks are overestimated only slightly worse than for block-sizes around 100. 

For our goal — determining the fastest algorithm configuration — the low accuracy in regions with 
low performance is not a major problem. 

In the predictions of the maximum and average ticks for variant 4 ( ), we further observe 
severe inaccuracies for block-sizes between 8 and 160. These are due to outliers in the measurements 
that the corresponding models were constructed upon. These outliers do not carry forward to the 
minimum and the median of ticks. 



4.3 LU Decomposition LU ^— A 

The second operation we study is the LU decomposition LU = A of a square matrix A. It is very 
commonly used to solve linear systems AX = B: With LU = A, we can write this as LUX = B 



not able to explain. 
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and apply dtrsm (solution of a triangular system) twice to obtain X = U~ 1 L~ 1 B. 

There are five blocked algorithms for the LU decomposition. They take A as an input and 
compute L and U in place: The unit-diagonal lower triangular L is stored in the strictly lower 
triangular part of A, whereas the non-unit-diagonal U is stored in the upper triangular part of A. 
Within the scheme of the blocked algorithms (Section 1.4), the five algorithmic variants perform 
the following updates: 

Variant 3 

^01 

An <- An - A W A 01 
A u i-LU(A U ) 
A 2 i «- A 2 i - A 20 A Q1 
A 21 <- AnCUn)- 1 



Variant 1 

Zoi <- (^Aoo)"^! 

A w <- Aio(^Aoo)- 1 
An *- An - AiqAqi 
A U <-LU(A U ) 



Variant 2 
Aio <- Aio(^Aoo)" 1 
An <- An - AioAoi 
A u <-LU(A n ) 

A12 <- A12 - Ai A 02 

A12 <- (KAii) _1 Ai 2 



Variant 4 



An «- An - A10A01 
An LU(A 11 ) 
A12 <- A12 - Ai A 02 
A 12 <- (NAn)- 1 ^ 
A21 <- A 2 i - A 20 A i 
A 2 i <- A 2 i(^An)- 1 



Variant 5 


An * 


- LUiAu) 


A12 + 


- (\SA 11 )- 1 A 12 


A 2 i + 


- A 2 i(^An)- 1 


A22 <■ 


- A 2 2 — A 2 \A\ 2 



Here. I\A and ^A denote the (strictly) lower and upper triangular part of A, respectively; LU (A) 
denotes the LU decomposition itself — an recursive application of the algorithm with block-size 
1 to a submatrix. Our C implementation of these algorithms is given in Appendix B.2; then- 
signatures are lui (n, A, IdA, blocksize) with i g {1,2,3,4}. 

We use the same method applied to L L^ 1 in the previous section to estimate the perfor- 
mance of these algorithms with the arguments 

n A IdA 

n , 



lui (n , A 



blocksize 
96 ), 



where a € {8, 16, ... , 1024}. The results of these predictions and measurements of the correspond- 
ing algorithm executions are shown in Figure 4.4, where efficiency is given by 



eff iciency 



|n 3 



in 2 
2 11 



2ticks 



Since it is almost impossible to distinguish the performance of the five variants in the ticks plot, 
we focus on efficiency (Figures 4.4b and 4.4c). We observe that for the most part our predictions 
are fairly close to the measurements. Up ton« 900, they can be used to rank the five algorithms 
correctly according to their performance. 

However, the performance of variant 5 ( ) — the fastest — is predicted with decreasing 

accuracy for larger values of n; the predictions are generally too low and diverge from our mea- 
surements. This leads so far, that the algorithms are incorrectly ranked for problems larger than 
n = 900. While the other variants are also slightly underestimated, their predictions are more 
stable and accurate. 



4.4 Sylvester Equation: Solving LX + XU = C for X 

We now study of a more complicated operation: the solution of the Sylvester equation. This 
operation is encountered in control theory and is generally of the form AX + XB = C, where 
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Figure 4.4: efficiency predictions for lui (n, A, n, 96). 



A £ R mxm , B e R nxn , and C £ E mx ™ are given and X £ K mx ™ is to be computed. We consider 
a special case of this equation, where A and B are lower and upper triangular, respectively: 
LX + XU = C. 



With ClIck [15, 16], a tool for the automatic generation of blocked algorithms, we found 16 
algorithmic variants for this operation. Each of them takes three matrices L, U, and X] X initially 
contains the input matrix C and is overwritten with the solution X by the algorithms. 



The update statements for the 16 algorithms are given below. Within these, Q(L, U, X) denotes 
a recursive invocation of the Sylvester equation solver for a smaller matrix X. The C implemen- 
tation of the algorithms is given in Appendix B.3; their signature is sylvi (m, n, L, IdL, U, 
ldU, X, ldX, blocksize) with % £ {1, . . . , 16}. 



SYLVESTER EQUATION: SOLVING LX + XU = C FOR X 



Variant 1 

-^01 4— Xqi — XqqUqi 

Xio <— Xiq — Lio^"oo 
Xoi 4— Sl(Loo,Uii,X i) 
Xio 4- Q(L\i, Uoo, Xio) 
Xn 4— Xn — XiqUqi 
Xn 4- Xn — L w X m 
Xn 4- Q(Lu, Uu,Xu) 



Variant 2 
Xqi 4- X i — X 00 U i 
Xio 4— Q{Li\, Uqo, Xio) 
Xqi 4— fi(Loo,Un, Xoi) 
Xn 4— Xn — X w Uqi 

X20 4— X20 — L21X10 

Xn 4- Xn — L10V01 
Xn 4— I7(in, Uii,Xn) 

X21 4- X21 — L21V11 
X21 4- X21 — 1/20^01 



Variant 3 
-^01 4- X i — V00C/01 
Xn 4— Vn — Xiot/01 
V21 4- V21 — V20C/01 
V i 4- i7(i 0j t^n , X i) 
Xn 4— Vn — iioVoi 
Xn 4- Q(Li\, Uu, Xn) 
X21 4- X 2 i — L 2 1^11 
V21 4- V21 — £20-^01 
X2I 4- J7(L22, ?/n, V21) 



Variant 4 
-V01 4- V01 — -Voo^oi 

Vl2 4- V12 — Vio?7o2 

V01 4— il(Loo, Un, Xoi) 
Xn 4— Xn — L10V01 
Xn<- n(Lii,Un,Xn) 

X21 4— X 2 i — L 2 iXn 
V21 4— X21 — ^20^01 
V21 4— fl(L 2 2, Un, X21) 
X22 4— ^22 — X 2 iUi2 



Variant 5 
V01 4- J?(Loo, C7"ii , -V01) 
V10 4- V10 — L10V00 

X02 4— V02 — ^01^12 

V10 4- Q[L\\, Uoo, Xio) 

Xn 4- Xn — X w Uqi 

Xn 4— Xn — Li V i 
Xn 4- n(Lii,Uii,Xn) 

X12 4- V12 — XuUyi 
X\2 4— V12 — XioUq 2 



Variant 6 



V i 4- Q(Loo, Un, Xoi) 
Xio 4— Q{Ln,Uoo,Xio) 



X02 


4- V02 — 


^01^12 


Xn 


4-X n - 


-^10^01 


X20 


4- V20 — 


^2\Xio 


Xn 


<-Xn- 


-^10^01 


Xn 


4- n(L u 


,Vn,X 


X12 


<-x 12 - 


XnUyi 


X21 


4- X21- 


-^21-^11 


X12 


4- V12 — 


-^10^02 


X21 


4- X21- 


-^20-^01 



Variant 7 


^01 


4- n(L 00 


, C^n, -^01) 


Xn 


<-X n - 


-^10^01 


-^21 


4- ^21 — 


-^20^01 


X02 


4- X02 — 


-^01^12 


Xn 


4-Xn- 


£10-^01 


Xn 


4- n(L n 


, U\i,X\i) 


X12 


4-X12- 


X11U12 


X21 


4- V21 — 


-^21^11 


X12 


<-Xi 2 - 


-^10^02 


X21 


4- V21- 


-^20-^01 


X21 


4- fl(L 2 2 


, ^11,^21) 



Variant 8 



V01 4— Q{Loo, Uu, Xoi) 

X02 4— V02 — -^01^12 

Vn 4- Xn — Li V i 
Xn 4- n(Lii,Uii,Xn) 

X12 4- V12 — XnUyi 
V 2 i 4— X21 — L 2 lXu 
V21 4— V21 — L20V01 
X2I 4~ n(L22,Un,X2l) 
X22 4— X 2 2 — ^21^12 



Variant 9 



Xio 4- V10 — 
V10 4— Q{Ln 
Xn 4- X\\ — 
Xn 4- Xn — 
Xn 4- Q(L U 

X12 4- V12 — 
X12 4- V12 — 
-^12 4— Xi 2 ~ 
X12 4— f}(Ln 



LiqXoo 

, Uoo, Xio) 
-^10^01 
iio-^01 
, Uu, Xn) 

-^11^12 
-^10^02 
iio-^02 

, U 2 2,Xi 2 ) 
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Variant 10 
Xio Xio — LiqXoq 

X2I X2I — Lj2qXqi 

Xiq <— il(Lu, Uoo, Xio) 
Xn ■<— Xn — Xio?7oi 
X 11 ^Q(L 11 ,U 11 ,X 11 ) 

X\2 X\2 — X\\U\2 

X\2 <— X\2 — Xi^Um 
X\2 <— &(Lii, U22, X12) 

X22 ^~ X22 — L>2\X\2 



Variant 11 



X\§ 


G(T , 1 




X^ 




XiqUqI 


X2§ 


^- X20 — 


-^21-^10 


X n 


^Xn~ 




X n 


<- Q{Ln 


,Un,X 


X\2 


^x 12 - 


X\\U\2 


X2I 


<-X 21 - 


^2\X\\ 


X\2 


<— X12 — 


X\{)Uti2 


X2I 


•S— X21 — 


^20 -^01 


X\2 


<— X12 — 


^10^02 


X\2 


<- f2(Ln 


, U22 7 x 



Variant 12 
Xio <— f}(Ln,Uoo,Xio) 
Xn <~ Xn — Xiof/01 
X20 X2Q — L21X1Q 
Xn <- f?(iii,C/ii,Xu) 
X12 X12 — XnU\2 
X21 <— X21 — L 2 \Xn 

X\2 <— X\2 — X\§U{)2 

X\2 <— Q{Ln,U22,Xi2) 
X22 *~ X22 — L%\X\2 



Variant 13 


Xn 


*-Xu- 


XiqUqi 


X21 


<-x n - 


^20^01 


Xn 




-^lO^Ol 


Xn 


<- f2(Lu 


i Un,Xn) 


X\2 


^x 12 - 


X\\U\2 


X2I 


^x 2l - 


L>2\Xn 


X\2 


^x l2 - 


X\§U§2 


X2I 


^X 2 i- 


-^20^01 


X\2 


^x 12 - 


LiqXq2 


X2I 


<- fl(L 2 2 


, C^n, -^21) 


X\2 


<- f2(Ln 


, U22, X12) 



Variant 14 



Xn ^— Xn — X10E/01 

X21 <— X21 — X20E/0I 

Xn <- fl(£n,l7ii,*ii) 
X12 X12 — XnU\2 
X21 ^- X21 — L 2 \Xn 

X\2 X12 — X 10 [/ 2 

X21 &{L22,Un, X21) 
X\2 <— Q(Ln, U22, X12) 

X22 ^~ X22 — Ij2\X\2 



Variant 15 



X\\ ^~ Xn — 

Xn <- 0{L lu U lu Xn) 

X\2 4— X\2 — XnU\2 

X21 X21 — L 2 iXn 

X\2 <— X12 ~ L 10 Xq2 

X21 <— X 2 i — L 2 qX i 
X\2 <— ^(Ln,U22, X12) 
X21 <— ^(£22, Un, X21) 
X22 ^ X22 — X21UX2 



Variant 16 



Xn <- Q{L ll ,U ll ,X ll ) 
X\2 <— X\2 — XnU\2 
X21 <— X21 — L 2 iXn 
X\2 <— [2(Ln, U22-,X\2) 
X21 ^— Q\Jj22i Un, X21) 

X22 4— X22 — X2\U\2 
X22 4— X22 — L>2lX\2 

These algorithms differ in a few ways from those in the previous sections: 

• They operate on three matrices, overwriting one of them with the output. 

• The input matrices are of different sizes: L G W xm , U £ M nxn , and X £ W xn . The matrices 
are traversed along the diagonal as far as possible and then along the remaining dimension 
(see Section 1.4.2). 

• There are three recursive calls Q in each step of the matrix traversal. These operate not only 
on the Xn € M blocksizexblocksi2e but also on the matrix panels X ou X w , X 12 , and X 2 \. For 
the latter, our C implementation invokes the blocked algorithms recursively; only the small 
matrices Xn trigger their unblocked versions. 



For our performance predictions, we focus on the case 

m n 

sylvi (n , n , 



IdL 

n , 



u 



ldU 

n , 



ldX 

n , 



blocksize 
96 ). 
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(c) efficiency measurements (d) efficiency prediction 
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Figure 4.5: efficiency predictions for sylvi (n, n, L, n, U, n, X, n, 96). 



All matrices are of size nxn, n € {8, 16, ... , 1024} and we use the block-size 96. Figure 4.5 compares 
our predictions for these algorithms with corresponding measurements of their implementations, 
where 

ff ■ n3 + n2 
err tctencu = . 

3 2ticks 

We observe significantly different performances across algorithms: At n = 1024 variant 1 (— •— ) is 
20 times faster than variant 13 (- >-). In the ticks plots, we see that our predictions separate the 
fast and slow variants very well. To rank the top candidates, we turn to efficiency; here, we see 
that the prediction quality is not as good as in our previous studies. However, the algorithms arc 
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Chapter 5 

Conclusion and Future Work 



In this thesis, we have introduced methods and tools to analyze and model the performance of 
dense linear algebra routines. Our goal was to rank sets of blocked algorithms according to their 
performance without executing them. Towards this goal, we created a set of automatic performance 
analysis and modeling tools. The Sampler measures the performance of routine executions; based 
on this tool, the Modeler generates performance models for specified routines. With these models, 
we were able to accurately predict the performance of blocked algorithms, rank them correctly, 
and determine the optimal algorithmic block-size. 

Sampler. As a base for our performance analysis, we designed a performance measurement tool: 
the Sampler (Section 2.3). This flexible tool can sample any dense linear algebra routine execution 
on any system. Provided with library or object files for a set of routines and according header files, 
its build system automatically generates an executable that allows to measure the performance of 
these routines. 

The cycle-accurate time stamp counter (RDTSC) serves as the base of its execution time measure- 
ments ticks. The Performance Application Programming Interface (PAPI) is optionally 1 used to 
access a wide range of other performance counters, such as flops or Limisses. These performance 
counters and further properties of the Sampler, such as memory handling policies, are specified 
through a simple configuration file (Section 2.3.2.1). 

The Sampler can be interfaced with other programs through its standard input and output 
streams. On its input stream, it expects a series of sampling requests, consisting of routine names 
and argument values; on the output stream, it provides performance measurements for these routine 
invocations. 

Thanks to its flexibility, functionalities, and simple interface, the Sampler serves as an ideal 
base for performance analysis in dense linear algebra. Apart form performance modeling, it can 
be used for tasks such as performance debugging or tuning. 

Modeler. Based on the Sampler, we constructed an automatic performance modeling tool: the 
Modeler (Section 3.3). For dense linear algebra routines, this tool builds models that represent the 
routines' performance as a function of a set of discrete (e.g., side, transA) and continuous (e.g., 
m, n) arguments. The generated models yield probabilistic performance estimates in the form of 
statistical quantities, such as minimum, median, or average. 

The Modeler, written in Python, is highly flexible: It provides a configuration system to specify 
a wide range of settings. These include but are not limited to the following: 

1 requiring a kernel patch. 



Gl 
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• The choice of the Sampler instance and its configuration; 

• The routines to be modeled; 

• The treatment of the routine arguments; 

• The performance counters to be modeled; 

• The type of polynomial modeling to be used; 

• The accuracy and resolution of the polynomial models. 

The Modeler aims at generating highly accurate performance models, while at the same time 
using as few measurements as possible. This poses a trade-off: reducing the number of samples 
generally decreases the model accuracy and vice versa. The Modeler configuration allows to balance 
these two aspects according to the desires of the user (see Section 3.4.2). 

Ranking. With the models automatically generated by the Modeler, we can predict the per- 
formance of blocked algorithms execution-less (Chapter 4). Our accurate predictions allow us 
to 

• Rank several algorithms according to their performance, and 

• Determine the optimal algorithmic block-size. 

5.1 Outlook 

The work presented in this thesis offers a number of possible directions for further research. 

Prediction of Other Performance Metrics. In our study of blocked algorithms, we focused 
on predicting the execution time ticks — one of many performance metrics. Our models, however, 
can describe a large variety of performance counters, all of which can be included in the prediction 
process. Of particular interest might be the cache miss counters Limisses through Ljinisses, since 
these are closely related to the processor's energy consumption. 

Other Algorithm Types. We limited ourselves to the performance prediction of blocked al- 
gorithms. These, however, are only one type of algorithms used in dense linear algebra. Other 
algorithm types include algorithms-by-block and recursive algorithms. Like blocked algorithms, 
these are based on BLAS Level-3 routines; our models for these routines can be used in the per- 
formance prediction of these algorithms. 

Shared Memory Parallelism. We were concerned with analyzing the performance of single 
core algorithms. It is crucial to understand the phenomena and behaviors that appear in this 
scenario before we target parallel algorithms. Since in the field of high performance computing 
CPUs usually provide several cores, considering shared memory parallelism is a natural next step. 

For blocked algorithms, parallelism can be exploited through multi-threaded BLAS implemen- 
tations. A possible approach would be to apply our measurement and modeling framework to these 
parallel BLAS routines and rank the algorithms accordingly. 
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Extrapolation. In this thesis, we build performance models for routines that represent a certain 
range of size arguments. We then use these models to predict the performance of blocked algo- 
rithms within the same range. In principle, the polynomials of our models can be extrapolated to 
estimate the performance of routine executions beyond their scope. Designed for a limited range 
of argument values, the estimation quality of our models will deteriorate when we extrapolate. 
However, extrapolation is an interesting usage scenario of performance models and performance 
modeling in general. 
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Appendix A 

Introduction to BLAS 



In this appendix, we discuss the Basic Linear Algebra Subprograms (BLAS) and their usage. BLAS 
as a low level interface specification, which can be accessed directly from within C or Fortran. 
However, virtually all dense linear algebra libraries in any other programming language is built on 
top of BLAS. 

BLAS is separated into three levels: 

• BLAS Level- 1 routines implement vector operations (e.g., scaling or addition of vectors); 

• BLAS Level-2 routines implement matrix- vector operations (e.g., matrix- vector multiplica- 
tion or solution of a triangular linear system); 

• BLAS Level-3 routines implement matrix-matrix operations (e.g., matrix-matrix multiplica- 
tion or solution of a triangular linear system with multiple right hand sides). 

Both BLAS Level-1 and BLAS Level-2 operations are memory bound; they cannot possibly reach 
the theoretical peak performance of a CPU. Only BLAS Level-3 operations are compute bound; 
carefully implemented, they can reach up to 99% of the peak performance. 

We focus on discussing the following properties of the interfaces for C and Fortran: 

• The routine names (Section A.l); 

• The routine arguments and matrix storage (Section A. 2); 

• Routine invocation and results (Section A. 3); 

• Available implementation (Section A. 4). 

Throughout this appendix, we consider dtrsm as an exemplary BLAS routine. It computes the 
solution of a triangular linear system with multiple right hand sides, B <— aA~ 1 B. 

A.l Routine names 

Due to restrictions in early versions of Fortran, BLAS routine names consist of a maximum of 6 
characters. They are chosen such that they represent both the operation and the data types. 

We now study the routine name dtrsm and show why it corresponds to the operation B 
aA~ l B. Character my character, dtrsm has the following meaning: 



G9 
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• d: The first character identifies the data type of all scalar, vector, and matrix arguments. 
Every BLAS routine is available for the following data types: 

— s: single precision real; 

— d: double precision real; 

— c: single precision complex (real and imaginary part are stored as two consecutive single 
precision numbers); 

— z: double precision complex (two consecutive doubles). 

• tr: These two characters indicate that a certain matrix is involved. There are again four 
possibilities: 

— ge: general matrix; 

— sy: symmetric matrix; 

— he: hermetian matrix; 

— tr: triangular matrix. 

• s: This character describes the type of operation: 

— m: multiplication; 

— s: solution of a linear system. 

• m: The last character indicates the type of the second operand: 

— v: vector; 

— m: matrix. 

Together, dtrsm stands for: 

double precision triangular linear system with multiple right hand sides (B <— aA~ 1 B). 
d tr s m 

Most BLAS Level-2 and Level-3 routines follow this naming scheme; BLAS Level-1 opera- 
tions use other similarly representative routine names. Examples of other routine names and the 
corresponding operations are the following: 

• zgemv: double precision complex general matrix vector product (y <— a Ax + j3y). 

z ge m v 

• csyrk: single precision complex symmetric rank k update (C aAA T + f3C). 

c sy rk 



saxpy: single precision scalar times vector plus vector (y 4- ax + y). 



s a x 



Since Fortran appends an underscore _ to all function names in its object files, this underscore 
needs to be appended to all BLAS routines, when they are used in C (e.g., dtrsm is available as 
dtrsm_). 
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A. 2 Arguments and Matrix Storage 

All arguments of BLAS routines are passed by reference, that is, in the form of pointers. This 
again is necessary for compatibility with Fortran, where arguments are always passed by reference. 

For each BLAS routine, we distinguish three groups of arguments: discrete, size, and data 
arguments. For example, dtrsm has the following signature: 

a _A_ _B_ 
dtrsm (side, uplo, transA, diag, m, n, alpha, A, IdA, B, ldB) . 

discrete size 




scalar matrix leading 
dimension 



• The discrete arguments side, uplo, transA, and diag specify the exact operation: 

— side € {Left, Right} identify on which side of B A appears: B «— olA~ x B or B f- 
aBA- 1 . 

— uplo € {Lowertriangular, Uppertriangular} specifies whether A is lower or upper 
triangular. In operations involving symmetric matrices, uplo indicates, which part of 
the matrix is stored in memory — the lower or upper triangular part. 

— transA 6 {Notranspose, Transpose indicates if the system is solved with A or A T . For 
complex valued routines, Transpose is replaced by Complex-conjugate. 

— diag £ {Non — Unittriangular, Unittriangular declares whether the entries of the 
diagonal of A are to be treated as ones (Unit triangular) independent of the data 
stored in memory. Using this argument, it is possible to store both an upper triangular 
and a lower triangular matrix in the memory of a full matrix, when one of them is unit 
triangular. 

The values of discrete arguments are identified by their first character (e.g., side = L for 
Left). All 16 value combinations for the four discrete argument in dtrsm are allowed. 

• The size arguments m and n specify the sizes of the matrix (and vector) operands. For dtrsm, 
we have B G M mxn ; depending on the value of side, A is of size m x m (L) or n x n (R). 

• The data arguments alpha, A, IdA, B, and ldB determine the storage locations of the operands: 

— The scalar argument alpha is (a pointer to) the value of a. 

— The matrix arguments A and B are (pointers to) these matrices in memory. As in Fortran, 
matrices are expected to be stored in column-major format : the columns of the matrix 
are continuous in memory; matrices consist of a set of consecutive columns. 

— The leading dimension arguments IdA and ldB, immediately following the corresponding 
matrix arguments A and B, give the distance between two adjacent matrix entries in the 
same row. Figure A.l shows how this can be used to address a matrix as part of a larger 
matrix. 

For vector-operations there are two further argument types: 

— Vector arguments (e.g., x) are (pointers to) vector in memory. 
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Figure A.l: A S M. mxn as part of an M x N matrix. 



— Increment arguments (e.g., incx) immediately follow the vector arguments and specify 
the distance between two consecutive elements of a vector in memory. These arguments 
allow to access both columns and rows of matrices as vectors. 

The order of data arguments corresponds to the order of the corresponding operands in 
the basic form of the operation; for dtrsm, alpha. A. IdA. B, and ldB are in the order of 
B <- aA- l B. 



A. 3 Routine Invocation and Results 

Most BLAS routines do not have an explicit return values, like functions do. Instead, one of the 
routine arguments is overwritten with the result of the operation — usually the last one. Exceptions 
are BLAS routines that compute scalar quantities; for example, the inner product of two vectors 
ddot returns x T y. 



A. 4 Implementations 

Since BLAS was introduced in 1979 [20], numerous implementations were developed. In this 
section, we will list the most prominent implementations for x86 processors and compare their 
efficiencies. 



• The reference implementation [2] is a minimal implementation of BLAS. Its purpose is to serve 
as a reference for both BLAS developers and users. The routines are well documented and 
easy to read and understand. Since this implementation was not designed with performance 
in mind, it is not suitable for scientific codes due to its poor performance. 

• GotoBLAS2 [18, 3] is an open source implementation developed at the Texas Advanced 
Computing Center (University of Texas at Austin); its name stems from its initiator and 
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Figure A. 2: Performance of dgemm in different BLAS implementations. 



former developer Kazushige Goto. It is written in C and assembly and contains highly 
optimized kernels for many CPU architectures. 

• The Automatically Tuned Linear Algebra Software (ATLAS) [23, 24, 1] is another open source 
implementation. During its building process, this library optimizes itself for the CPU archi- 
tecture; for this, it uses empirical techniques to estimate the best configuration of its routines. 

• Intel's commercial Math Kernel Library (MKL) [5] offers (among other functionality) a high 
performance implementation of BLAS for Intel architectures. 

We now compare the performance of these BLAS implementations by considering dgemm (C 
aAB + j3C). This routine is usually the most optimized routine in any high performance library 
(In many implementations, other BLAS Level-3 routines are based on dgemm kernels [17]). We 
use the Sampler (Section 2.3) on an Intel Harpertown E5450 processor [6] running at 2.99GHz to 
measure the ticks (execution time) of dgemm with the following parameters: 



transA transB 

dgemm ( N N 



k alpha 
k , 0.5 , 



Id A 

m , 



B 



ldB 



beta C 
0.5, C, 



ldC 

m ), 



that is, C <- 0.5AB + 0.5C with A,B,C € 
of these executions, where 



efficiency 



In Figure A. 2 we show the ticks and efficiency 

n 3 + 2n 2 



2ticks 

As expected, the reference implementation ( ) attains the poorest performance. For matrices 
that fit in the 12MB cache of the processor, it reaches 28% of the peak performance; for larger 
matrices, the performance drops rapidly to 7%. The high performance implementations Goto- 
BLAS2 (•), MKL (•), and ATLAS (•) attain efficiencies that exceed the reference implementation 
by more than a factor of 10. With an efficiencies around 96%, the hand optimized GotoBLAS2 (•) 
and MKL (•) outperform the automatically tuned ATLAS (•), which reaches 87%. 
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Appendix B 

Implementations of Blocked 
Algorithms 



This appendix lists the implementations of the blocked algorithms used throughout the thesis. All 
algorithms are written in C and need to linked with a BLAS library. 

We implemented only the blocked version of each algorithm. The unblocked version is implicitly 
given by passing blocksize = 1. 



B.l Triangular Inverse L ^— L 



Listing B.l: trinv. c — inversion of a triangular matrix. 



extern void dtrmm_ ( char * , char*, char*, char*, int*, int*, double*, double*, int*, 
double*, int*); 

extern void dtrsm_ ( char * , char*, char*, char*, int*, int*, double*, double*, int*, 
double*, int*); 

extern void dgemm_ ( char * , char*, int*, int*, int*, double*, double*, int*, double*, 
int*, double*, double*, int*); 

int trinvl (char* diag , int* n, double* A, int* IdA , int* blocksize) { 
if (* n = 1) { 

if (diag [0] = 'N') 

*A = 1 / *A; 
return 0; 

} 

int ione = 1; 
double one = 1; 
double minusone = — 1; 

int p = 0; 

while (p < *n) { 

int b = *blocksize; 

if (p+b>*n) 
b = *n — p ; 

int r = *n — (p + b) ; 



// A00 

// A10 All 

// A20 A21 A22 
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#define AOO (A) 

#define A10 (A + p) 

#define All (A + *ldA * p + p) 

#define A20 (A + p + b) 

#define A21 (A + *ldA * p + p + b) 

#define A22 (A + *ldA * (p + b) + p + b) 

// A10 = A10 * AOO 

dtrmm_("R", "L" , "N" , diag , &b , &p , tone, AOO, IdA , A10, IdA) ; 

// A10 = -All \ A10 

dtrsm_("L", "L" , "N" , diag, &b , &p , feminusone , All, IdA, A10 , IdA); 

// All = 1 / All 

trinvl__ (diag , &b, All, IdA, feione) ; 

p += b; 

} 

return 0; 

} 

int trinv2_ ( char* diag, int* n, double* A, int* IdA, int* blocksize) { 
if (* n = 1) { 

if (diag [0] = 'N') 

*A = 1 / *A; 
return 0; 

} 

int ione = 1; 
double one = 1; 
double minusone = — 1; 

int p = ; 

while (p < *n) { 

int b = *blocksize; 

if (p+b>*n) 
b = *n — p ; 

int r=*n— (p+b); 

// AOO 
// A10 All 
// A20 A21 A22 
#define AOO (A) 

#define A10 (A + p) 

#define All (A + *ldA * p + p) 

#define A20 (A + p + b) 

#define A21 (A + *ldA * p + p + b) 

#define A22 (A + *ldA * (p + b) + p + b) 

// A21 = A22 \ A21 

dtrsm_("L", "L" , "N" , diag, &r , &b , tone, A22 , IdA, A21 , IdA); 

// A21 = -A21 / All 

dtrsm_("R", "L" , "N" , diag, &r , &b , feminusone , All, IdA, A21 , IdA); 

// All = 1 / All 

trinv2__ (diag , &b , All, IdA, feione); 

p += b; 

} 

return 0; 

} 

int trinv3_ ( char* diag, int* n, double* A, int* IdA, int* blocksize) { 
if (* n = 1) { 

if (diag [0] = 'N') 
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*A = 1 / *A; 
return 0; 

} 

int ione = 1; 
double one = 1; 
double minusone = — 1; 

int p = 0; 

while (p < *n) { 

int b = *blocksize; 

if (p+b>*n) 
b = *n — p ; 

int r = *n — (p + b) ; 

// A00 
// A10 All 
// A20 A21 A22 
#define A00 (A) 

#define A10 (A + p) 

#define All (A + *ldA * p + p) 

#define A20 (A + p + b) 

#define A21 (A + *ldA * p + p + b) 

#define A22 (A + *ldA * (p + b) + p + b) 

// A21 = -A21 / All 

dtrsm_("R", "L" , "N" , diag , &r , &b , faninusone , All, IdA , A21 , IdA); 

// A20 = A21 * A10 + A20 

dgemm_("N", "N" , fer , &p , &b , tone, A21 , IdA, A10 , IdA, tone, A20 , IdA) 

// A10 = All \ A10 

dtrsm_("L", "L" , "N" , diag, &b , &p , tone, All, IdA, A10, IdA); 

// All = 1 / All 

trinv3 _ ( diag , &b , All, IdA, feione ) ; 

p += b; 

} 

return 0; 

} 

int trinv4_ ( char* diag, int* n, double* A, int* IdA, int* blocksize) { 
if (* n = 1) { 

if (diag[0] = 'N') 

*A = 1 / *A; 
return ; 

} 

int ione = 1; 
double one = 1; 
double minusone = — 1; 

int p = ; 

while (p < *n) { 

int b = *blocksize; 

if (p+b>*n) 
b = *n — p ; 

int r = *n — (p + b) ; 

// A00 
// A10 All 
// A20 A21 A22 
#define A00 (A) 

#define A10 (A + p) 
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#define All (A + *ldA * p + p) 

#define A20 (A + p + b) 

#define A21 (A + *ldA * p + p + b) 

#define A22 (A + *ldA * (p + b) + p + b) 

// A21 = -A22 \ A21 

dtrsm_("L", »L" , "N" , diag , &r , &b , faninusone , A22 , IdA , A21 , IdA); 

// A20 = -A21 * A10 + A20 

dgemm_("N", "N" , &r , &p , &b , feminusone , A21 , IdA, A10 , IdA, tone, A20 , IdA) 

// A10 = A10 * A00 

dtrmm_("R", "L" , "N" , diag, to , &p , tone, A00 , IdA, A10, IdA); 

// All = 1 / All 

trinv4__ ( diag , &b, All, IdA, feione) ; 

p += b; 

} 

return 0; 



B.2 LU Decomposition LU <— A 



Listing B.2: lu.c — LU decomposition. 



extern 


void dtrmm (char 


* , 


char * , 


char 


*, char*, int*, int*, 


double * , 


double 


* , int * , 


double*, int *) ; 
















extern 


void dtrsm (char 


* , 


char * , 


char 


*, char*, int*, int*, 


double * , 


double 


* , int * , 


double*, int*); 
















extern 


void dgemm (char 


* , 


char * , 


int * 


, int*, int*, double* 


double * 


int * , 


double * , 


int*, double*, double 


* , int * 


); 










int lul 


(int * n , double 


* A, int* 


IdA , 


int * blocksize ) { 








if " 


(*n = 1) 


















return 0; 
















int 


ione = 1; 
















double one = 1; 
















double minusone = — 


i; 














int 


p = 0; 
















while (p < *n) { 


















int b = *blocksize 
















if (p+b>*n) 


















b = *n — p ; 


















int r = *n — (p 


+ 


b); 














// A00 A01 A02 


















// A10 All A12 


















// A20 A21 A22 
















#define 


A00 (A) 
















#define 


A01 (A + *ldA * 


p) 














#define 


A02 (A + *ldA * 


(p 


+ b)) 












#define 


A10 (A 




-f 


P) 










#define 


All (A + *ldA * 


p 


4 


P) 










#define 


A12 (A + *ldA * 


(p 


+ b) 4 


P) 










#define 


A20 (A 




4 


P + 


b) 








#define 


A21 (A + *ldA * 


p 


4 


P + 


b) 








#define 


A22 (A + *ldA * 


(p 


+ b) 4 


P + 


b) 
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// AOl = trilu( A00 ) \ A01 

dtrsm_("L", "L" , "N" , "U" , to, to, tone , A00 , IdA , AOl, IdA); 
// A10 = AlO / triu ( AOO ) 

dtrsm_("R", "U" , "N" , "N" , to , to , tone, AOO, IdA, AlO, IdA); 

// All = All - AlO * AOl 

dgemm_("N", "N" , to, to, to, toninusone , AlO, IdA, AOl, IdA, tone, All, IdA) 

// All = LU(All) 

lul_(&b, All, IdA, tione); 



} 



P += b; 

} 

return 0; 



int lu2_(int* n, double* A, int* IdA, int* blocksize) { 
if (* n = 1) 
return ; 

int ione = 1; 
double one = 1; 
double minusone = — 1; 

int p = 0; 

while (p < *n) { 

int b = *blocksize; 

if (p+b>*n) 
b = *n — p ; 

int r = *n — (p + b) ; 

// AOO AOl A02 

// AlO All A12 

// A20 A21 A22 
#define AOO (A) 
#define AOl (A + *ldA * p) 
#define A02 (A + *ldA * (p + b)) 
#define AlO (A + p) 

#define All (A + *ldA * p + p) 

#define A12 (A + *ldA * (p + b) + p) 
#define A20 (A + p + b) 

#define A21 (A + *ldA * p + p + b) 

#define A22 (A + *ldA * (p + b) + p + b) 

// AlO = AlO / triu ( AOO ) 

dtrsm_("R", "U" , "N" , "N" , to , to , tone, AOO, IdA, AlO, IdA); 

// All = All - AlO * AOl 

dgemm_("N", "N" , to, to, to, faninusone , AlO, IdA, AOl, IdA, tone, All, IdA) 

// All = LU(All) 

lu2_(&b, All, IdA, tione); 

// A12 = A12 - AlO * A02 

dgemm_("N", "N" , to, to, to, toninusone, AlO, IdA, A02 , IdA, tone, A12 , IdA) 

// A12 = trilu( All ) \ A12 

dtrsm_("L", "L" , "N" , "U" , to, to, tone, All, IdA, A12 , IdA); 

p += b; 

} 

return 0; 

} 

int lu3_(int* n, double* A, int* IdA, int* blocksize) { 
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100 
101 
102 
103 
104 
105 
106 
107 
108 
109 
110 
111 
112 
113 
114 
115 
116 
117 
118 
119 
120 
121 
122 
123 

124 
125 
126 
127 

128 
129 
130 
131 
132 
133 
134 
135 
136 
137 
138 
139 
140 
141 
142 
143 
144 
145 
146 
147 
148 
149 
150 
151 



if (*n = 1) 

return 0; 

int ione = 1; 
double one = 1; 
double minusone = —1; 

int p = ; 

while (p < *n) { 

int b = *blocksize; 

if (p+b>*n) 
b = *n — p ; 

int r = *n — (p + b) ; 

// A00 A01 A02 

// A10 All A12 

// A20 A21 A22 
#define A00 (A) 
#define A01 (A + *ldA * p) 
#define A02 (A + *ldA * (p + b)) 
#define A10 (A + p) 

#define All (A + *ldA * p + p) 

#define A12 (A + *ldA * (p + b) + p) 
#define A20 (A + p + b) 

#define A21 (A + *ldA * p + p + b) 

#define A22 (A + *ldA * (p + b) + p + b) 

// A01 = trilu( A00 ) \ A01 

dtrsm_("L", "L" , "N" , "U" , to, to, tone, AOO , IdA , A01 , IdA); 

// All = All - A10 * A01 

dgemm_("N", "N" , to, to, to, toiinusone , A10, IdA, A01 , IdA, tone, All, IdA) 

// All = LU(All) 

lu3_(&b, All, IdA, tione); 

// A21 = A21 - A20 * A01 

dgemm_("N", "N" , &r , to, to , toiinusone , A20 , IdA, A01 , IdA, tone, A21 , IdA) 

// A21 = A21 / triu ( All ) 

dtrsm_("R", "U" , "N" , "N" , to, to, tone, All, IdA, A21 , IdA); 

p += b; 

} 

return 0; 

} 

int lu4_(int* n, double* A, int* IdA, int* blocksize) { 
if (* n = 1) 
return 0; 

int ione = 1; 
double one = 1; 
double minusone = —1; 

int p = ; 

while (p < *n) { 

int b = *blocksize; 

if (p+b>*n) 
b = *n — p ; 

int r = *n — (p + b) ; 



// AOO A01 A02 



B.2. LU DECOMPOSITION LU <- A 
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#de 
#de 
#de 
#de 
#de 
#de 
#de 
#de 
#de 



// A10 All A12 

// A20 A21 A22 
fine AOO (A) 
fine A01 (A + *ldA * p) 
fine A02 (A + *ldA * (p + b)) 
fine AlO (A + p) 

fine All (A + *ldA * p + p) 

fine A12 (A + *ldA * (p + b) + p) 
fine A20 (A + p - 

fine A21 (A + *ldA * p + p - 

fine A22 (A + *ldA 



b) 
b) 



(p + b) + p + b) 
// All = All - AlO * AOl 

dgemm_("N", "N" , to, &b, to, toninusone , AlO, IdA , AOl, IdA , tone, All, IdA) 

// All = LU(All) 
lu4_(&b, All, IdA, tione) 



168 




// A12 = A12 - 


AlO * 


A02 










169 




dgemm ( "N" , "N" 


, to, 


&r , 


&p , toninusone , AlO , 


IdA , 


A02, 


IdA , tone , 


170 




// A12 = trilu ( 


All 


) \ A12 








171 




dtrsm ( "L" , "L" 


, "N" 


, "U" 


, to, to , tone , All , 


IdA , 


A12, 


IdA) ; 


172 




// A21 = A21 - 


A20 * 


AOl 










173 




dgemm_ ( "N" , "N" 


, &r, 


to, 


&p , feminusone , A20 , 


IdA , 


AOl , 


IdA , tone , 


174 




// A21 = A21 / 


t r i u ( 


All 


) 








175 




dtrsm_ ( "R" , "U" 


, "N" 


, "N" 


, &r , to, tone , All , 


IdA , 


A21 , 


IdA) ; 


176 


















177 




P += b; 














178 


} 
















179 


return 0; 














180 


} 
















181 


















182 


int lu5 


(int * n , double 


* A, 


int * 


IdA, int* blocksize) 


{ 






183 


if " 


(*n = 1) 














184 




return 0; 














185 


















186 


int 


ione = 1; 














187 


double one = 1; 














188 


double minusone = - 


-i; 












189 


















190 


int 


p = 0; 














191 


while (p < *n) { 














192 




int b = *blocksize; 












193 




if (p+b>*n) 














194 




b = *n — p ; 














195 




int r=*n— (p 


+ b); 










196 


















197 




// AOO AOl A02 














198 




// AlO All A12 














199 




// A20 A21 A22 














200 


#define 


AOO (A) 














201 


#define 


AOl (A + *ldA * 


P) 












202 


#define 


A02 (A + *ldA * 


(P + 


b)) 










203 


#define 


AlO (A 




4 


P) 








204 


#define 


All (A + *ldA * 


P 


4 


P) 








205 


#define 


A12 (A + *ldA * 


(P + 


b) 4 


P) 








206 


#define 


A20 (A 




4 


P + b) 








207 


#define 


A21 (A + *ldA * 


P 


4 


P + b) 








208 


#define 


A22 (A + *ldA * 


(P + 


b) 4 


P + b) 








209 


















210 




// All = LU(All) 
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lu5_(&b, All, IdA, &ione); 

// A12 = trilu ( All ) \ A12 

dtrsm_("L", "L" , "N" , "U" , &b, &r , tone , All, IdA, A12 , IdA); 
// A21 = A21 / triu ( All ) 

dtrsm_("R", "U" , "N" , "N" , &r , to, tone, All, IdA, A21 , IdA); 

// A22 = A22 - A21 * A12 

dgemm_("N", "N" , &r , &r , to, faninusone , A21 , IdA, A12 , IdA, tone, A22 , IdA) 



P += b; 

} 

return 0; 



B.3 Sylvester Equation: Solving LX + XU = C for X 

Since the file containing all 16 variants of the Sylvester Equation solver spans 1,127 lines, we only 
give the full code up to and including the first variant. For variants 2 through 16, we list only the 
update statements; the surrounding part of the algorithms is, up to the name, identical to the first 
variant. 



Listing B.3: sylv.c (extract up to first algorithm) — Solution of Sylvester Equation. 



1 


extern 


void 


igemm (char*, char*, int*, 


int*, int*, double*, double*, int*, double*, 




int*, double*, double*, 


int*) ; 




2 
3 




// LOO L01 L02 






4 




// L10 Lll L12 






5 




// L20 L21 L22 






6 


#define 


LOOm 


Lp 






7 


#define 


LOlm 


Lp 






8 


#define 


L02m 


Lp 






9 


#define 


LlOm 


Lb 






10 


#define 


Lllm 


Lb 






11 


#define 


L12m 


Lb 






12 


#define 


L20m 


Lr 






13 


#define 


L21m 


Lr 






14 


#define 


L22m 


Lr 






15 


#define 


LOOn 


Lp 






16 


#define 


LOln 


Lb 






17 


#define 


L02n 


Lr 






18 


#define 


LlOn 


Lp 






19 


#define 


Llln 


Lb 






20 


#define 


L12n 


Lr 






21 


#define 


L20n 


Lp 






22 


#define 


L21n 


Lb 






23 


#define 


L22n 


Lr 






24 


#define 


LOO 


(L) 






25 


#define 


L01 


(L + *ldL * Lp) 






26 


#define 


L02 


(L + *ldL * (Lp 


f Lb)) 




27 


#define 


L10 


(L 


+ Lp) 




28 


#define 


Lll 


(L + *ldL * Lp 


+ Lp) 




29 


^define 


L12 


(L + *ldL * (Lp 


f Lb) + Lp) 




30 


#define 


L20 


(L 


+ Lp 


■f Lb) 


31 


#define 


L21 


(L + *ldL * Lp 


+ Lp 


+ Lb) 


32 


#define 


L22 


(L + *ldL * (Lp 


f Lb) + Lp 


f Lb) 


33 
34 




// U00 U01 U02 






35 




// U10 Ull U12 
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S3 





// U20 U21 U22 










#define 


UOOm Up 












#define 


UOlm Up 












#define 


U02m Up 












#define 


UlOm Ub 












#define 


Ullm Ub 












#define 


U12m Ub 












#define 


U20m Ur 












#define 


U21m Ur 












#define 


U22m Ur 












#define 


UOOn Up 












#define 


UOln Ub 












#define 


U02n Ur 












#define 


UlOn Up 












#define 


Ulln Ub 












#define 


U12n Ur 












#define 


U20n Un 












-44- n a f l ri o 
-fj- U C 1 1 11 C 


U21n Ub 












#define 


U22n Ur 












#define 


U00 (U) 












#define 


U01 (U - 


- *ldU * 


Up) 








#define 


U02 (U - 


- *ldU * 


(Up 


+ Ub)) 






#define 


U10 (U 








f- Up) 




#define 


Ull (U - 


- *ldU * 


Up 




L Up) 




#define 


U12 (U - 


- *ldU * 


(Up 


+ Ub) - 


h Up) 




#define 


U20 (U 








\- Up 4 


Ub) 


#define 


U21 (U - 


- *ldU * 


Up 




h Up 4 


Ub) 


#define 


U22 (U - 


- *ldU * 


(Up 


+ Ub) - 


L Up 4 


Ub) 




// X00 X01 X02 












// X10 Xll X12 












// X20 X21 X22 










#define 


XOOm Lp 












#define 


XOlm Lp 












#define 


X02m Lp 












#define 


XlOm Lb 












#define 


Xllm Lb 












#define 


X12m Lb 












#define 


X20m Lr 












#define 


X21m Lr 












#define 


X22m Lr 












#define 


XOOn Up 












#define 


XOln Ub 












#define 


X02n Ur 












#define 


XI On Up 












#define 


XI In Ub 












#define 


X12n Ur 












#define 


X20n Up 












#define 


X21n Ub 












#define 


X22n Ur 












#define 


XOO (X) 












#define 


XOl (X - 


- *ldX * 


Up) 








#define 


X02 (X - 


- *ldX * 


(Up 


+ Ub)) 






#define 


XIO (X 








h Lp) 




#define 


Xll (X - 


- *ldX * 


Up 




L Lp) 




#define 


X12 (X - 


- *ldX * 


(Up 


+ Ub) - 


L Lp) 




#define 


X20 (X 








F Lp 4 


Lb) 


#define 


X21 (X - 


- *ldX * 


Up 




F Lp 4 


Lb) 


#define 


X22 (X - 


- *ldX * 


(Up 


+ Ub) - 


L Lp 4 


Lb) 


int sylv 


1 (int* 


m, int * 


n, double* 


L, int* IdL , double* U, int* ldU , double* X, int 
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* ldX , int* blocksize) { 
#undef sylv_ 
#define sylv_ sylvl_ 
i f (*m * *n = 0) 

return 0; 
if ( *m * *n = 1 ) { 
*X /= *L + *U; 
return 0; 

} 

int ione = 1; 
double one = 1; 
double mone = —1; 

int p = ; 

int b = *blocksize; 

if ( b >= *m && b >= * n ) 

b = 1; 

while (p < *n || p< *m) { 

int Lp = p; 
int Lb = b; 
i f ( Lp >= *m) { 

Lp = *m; 

Lb = 0; 
} else 

i f ( Lp + Lb > *m) 
Lb = *m — Lp ; 
int Lr = *m — (Lp + Lb) ; 

i n t Up = p ; 
int Ub = b; 
if (Up >= *n) { 

Up = * n ; 

Ub = 0; 
} else 

if (Up + Ub > *n) 
Ub = * n — Up ; 
int Ur = *n - (Up + Ub) ; 

// X01 = -X00 U01 + X01 

dgemm_("N", "N" , &X01m, &X01n , &X00n , toione , X00 , ldX , U01 , ldU , tone, X01 , 
ldX) ; 

// X10 = -L10 X00 + X10 

dgemm_("N", "N" , &X10m, &X10n , &L10n , toione , L10 , IdL , X00 , ldX , tone, X10 , 
ldX) ; 

// X01 = sylv(L00, Ull, X01) 

sylv_(&X01m, &X01n, LOO, IdL, Ull, ldU , X01 , ldX , blocksize); 

// X10 = sylv(Lll, U00, X10) 

sylv_(&X10m, &X10n, Lll , IdL, U00 , ldU , X10, ldX , blocksize); 

// Xll = -X10 U01 + Xll 

dgemm_("N", "N" , &Xllm, &Xlln, &X10n, toione, X10 , ldX , U01 , ldU , tone, Xll, 
ldX) ; 

// Xll = -L10 X01 + Xll 

dgemm_("N", "N" , &Xllm, &Xlln, &L10n , toione, L10 , IdL, X01 , ldX , tone, Xll, 
ldX) ; 

// Xll = sylv(Lll, Ull, Xll) 

sylv_(&Xllm, &Xlln, Lll, IdL, Ull, ldU , Xll, ldX , blocksize); 

p += b; 

} 

return 0; 



B.3. SYLVESTER EQUATION: SOLVING LX + XU = C FOR X 
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155 

Listing B.4: sylv.c (2 nd algorithm updates) — Solution of Sylvester Equation. 



197 
198 
199 
200 
201 
202 
203 
20d 
205 
206 
207 
208 
209 
210 
211 
212 
213 
211 



// X01 = -X00 U01 + X01 

dgemm_("N", "N" , &X01m, faXOln, faXOOn , femone, X00 , ldX , U01 , ldU , fame , X01 , ldX) 

// X10 = sylv(Lll, UOO, X10) 

sylv_(faX10m, faXlOn , Lll , IdL , UOO, ldU , X10 , ldX , blocksize); 

// X01 = sylv(LOO, Ull, X01) 

sylv_(faX01m, faXOln, LOO, IdL, Ull, ldU , X01 , ldX , blocksize); 

// Xll = -X10 U01 + Xll 

dgemm_( "N" , "N" , &Xllm, faXlln, faXlOn , fanone , X10 , ldX , U01 , ldU , fane, Xll, ldX) 

// X20 = -L21 X10 + X20 

dgemm_( "N" , "N" , &X20m, faX20n , &L21n , fanone, L21 , IdL, X10 , ldX , fane, X20 , ldX) 

// Xll = -L10 X01 + Xll 

dgemm_("N", "N" , &Xllm, faXlln, &L10n , fanone, L10 , IdL, X01 , ldX , fane, Xll, ldX) 

// Xll = sylv(Lll, Ull, Xll) 

sylv_(faXllm, faXlln, Lll, IdL, Ull, ldU , Xll, ldX , blocksize); 

// X21 = -L21 Xll + X21 

dgemm_( "N" , "N" , &X21m, faX21n , &L21n , fanone, L21 , IdL, Xll, ldX , fane, X21 , ldX) 

// X21 = -L20 X01 + X21 

dgemm_("N", "N" , &X21m, faX21n , &L20n , fanone, L20 , IdL, X01 , ldX , fane, X21 , ldX) 



201 
202 
263 
264 
265 
266 
267 
268 
269 
2 70 
271 
272 
273 
274 
275 
276 
277 
278 



Listing B.5: sylv. c (3 algorithm updates) — Solution of Sylvester Equation. 

// XOl = -XOO UOl + X01 

dgemm_( "N" , "N" , &X01m, faXOln , faXOOn , fanone, XOO, ldX , UOl, ldU , fane, XOl, ldX) ; 

// Xll = -X10 UOl + Xll 

dgemm_("N", "N" , faXllm, faXlln, faXlOn , fanone, X10 , ldX , UOl, ldU , fane, Xll, ldX) ; 

// X21 = -X20 UOl + X21 

dgemm_( "N" , "N" , &X21m, faX21n , &X20n , fanone, X20 , ldX , UOl, ldU , fane, X21 , ldX) ; 

// XOl = sylv (LOO, Ull, XOl) 

sylv_(faX01m, faXOln, LOO, IdL, Ull, ldU , XOl, ldX , blocksize); 

// Xll = -L10 XOl + Xll 

dgemm_( "N" , "N" , &Xllm, faXlln, &L10n , fanone, L10 , IdL, XOl, ldX , fane, Xll, ldX) ; 

// Xll = sylv (Lll, Ull, Xll) 

sylv_(faXllm, faXlln, Lll, IdL, Ull, ldU , Xll, ldX , blocksize); 

// X21 = -L21 Xll + X21 

dgemm_( "N" , "N" , &X21m, faX21n , &L21n , fanone, L21 , IdL, Xll, ldX , fane, X21 , ldX) ; 

// X21 = -L20 XOl + X21 

dgemm_("N", "N" , &X21m, faX21n , &L20n , fanone, L20 , IdL, XOl, ldX , fane, X21 , ldX) ; 

// X21 = sylv(L22, Ull, X21) 

sylv_(faX21m, &X21n, L22 , IdL, Ull, ldU , X21 , ldX , blocksize); 



Listing B.6: sylv. c (4 algorithm updates) — Solution of Sylvester Equation. 



325 
326 
327 
328 
329 
330 
331 
332 
333 
334 
335 
336 
337 
338 



// XOl = —XOO UOl + XOl 

dgemm_( "N" , "N" , faXOlm, faXOln , faXOOn , fanone, XOO, ldX , UOl, ldU , fane, XOl, ldX) 

// X12 = -X10 U02 + X12 

dgemm_( "N" , "N" , &X12m, faX12n , faXlOn , fanone, X10 , ldX , U02 , ldU , fane, X12 , ldX) 

// XOl = sylv (LOO, Ull, XOl) 

sylv_(faX01m, faXOln, LOO, IdL, Ull, ldU , XOl, ldX , blocksize); 

// Xll = -L10 XOl + Xll 

dgemm_("N", "N" , &Xllm, faXlln, &L10n , fanone, L10 , IdL, XOl, ldX , fane, Xll, ldX) 

// Xll = sylv(Lll, Ull, Xll) 

sylv_(faXllm, faXlln, Lll, IdL, Ull, ldU , Xll, ldX , blocksize); 

// X21 = -L21 Xll + X21 

dgemm_( "N" , "N" , &X21m, faX21n , &L21n , fanone, L21 , IdL, Xll, ldX , fane, X21 , ldX) 

// X21 = -L20 XOl + X21 

dgemm_("N", "N" , &X21m, faX21n , &L20n , fanone, L20 , IdL, XOl, ldX , fane, X21 , ldX) 
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339 // X21 = sylv(L22, Ull , X21) 

340 sylv_(faX21m, faX21n , L22 , IdL , Ull, ldU , X21 , ldX , blocksize); 

341 // X22 = -X21 U12 + X22 

342 dgemm_( "N" , "N" , &X22m, &X22n , faX21n , femone, X21 , ldX , U12 , ldU , tone , X22 , ldX) ; 



Listing B.7: sylv. c (5 algorithm updates) — Solution of Sylvester Equation. 



389 
390 
391 
392 
393 
394 
395 
396 
397 
398 
399 
400 
401 
402 
403 
404 
405 
406 



// X01 = sylv (LOO, Ull, X01) 

sylv_(faX01m, faXOln, LOO, IdL, Ull, ldU , X01 , ldX , blocksize); 

// X10 = -L10 X00 + X10 

dgemm_( "N" , "N" , &X10m, faXlOn , &L10n , femone, L10 , IdL, X00 , ldX , tone, X10 , ldX) ; 

// X02 = -X01 U12 + X02 

dgemm_("N", "N" , &X02m, faX02n , faXOln, fanone, X01 , ldX , U12 , ldU , tone, X02 , ldX) ; 

// X10 = sylv(Lll, U00, X10) 

sylv_(faX10m, faXlOn , Lll , IdL, UOO , ldU , X10 , ldX , blocksize); 

// Xll = -X10 U01 + Xll 

dgemm_("N", "N" , &Xllm, faXlln, faXlOn , fanone , X10 , ldX , U01 , ldU , tone, Xll, ldX) ; 

// Xll = -L10 X01 + Xll 

dgemm_("N", "N" , &Xllm, faXlln, &L10n , fanone, L10 , IdL, X01 , ldX , tone, Xll, ldX) ; 

// Xll = sylv (Lll, Ull, Xll) 

sylv_(faXllm, faXlln, Lll, IdL, Ull, ldU , Xll, ldX , blocksize); 

// X12 = -Xll U12 + X12 

dgemm_( "N" , "N" , &X12m, faX12n , faXlln, femone, Xll, ldX , U12 , ldU , tone, X12 , ldX) ; 

// X12 = -X10 U02 + X12 

dgemm_("N", "N" , &X12m, &X12n , faXlOn , fanone, X10 , ldX , U02 , ldU , tone, X12 , ldX) ; 



Listing B.8: sylv.c (6 algorithm updates) — Solution of Sylvester Equation. 



453 
454 
455 
456 
457 
458 
459 
460 
461 
462 
463 
464 
465 
466 
467 
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471 
472 
473 
474 



521 
522 
523 
524 
525 
526 
527 



// X01 = sylv (LOO, Ull, X01) 

sylv_(faX01m, faXOln, LOO, IdL, Ull, ldU , X01 , ldX , blocksize) 

// X10 = sylv(Lll, UOO, X10) 

sylv_(faX10m, faXlOn , Lll, IdL, UOO, ldU , X10 , ldX , blocksize) 

// X02 = -X01 U12 + X02 

dgemm_("N", "N" , &X02m, faX02n , faXOln , fanone, X01 , Id 

// Xll = -X10 U01 + Xll 

dgemm_("N", "N" , &Xllm, feXlln , faXlOn , fanone, X10 , Id 

// X20 = -L21 X10 + X20 

dgemm_("N", "N" , &X20m, &X20n , &L21n , fanone, L21 , Id 

// Xll = -L10 X01 + Xll 

dgemm_("N", "N" , &Xllm, faXlln, &L10n , fanone, L10 , Id 

// Xll = sylv (Lll, Ull, Xll) 

sylv_(faXllm, faXlln, Lll, IdL, Ull, ldU , Xll, ldX , blocksize) 

// X12 = -Xll U12 + X12 

dgemm_("N", "N" , &X12m, faX12n , faXlln, fanone, Xll, Id 

// X21 = -L21 Xll + X21 

dgemm_("N", "N" , &X21m, faX21n , &L21n , fanone, L21 , Id 

// X12 = -X10 U02 + X12 

dgemm_("N", "N" , &X12m, faX12n , faXlOn , fanone, X10 , Id 

// X21 = -L20 X01 + X21 



U12, 


ldU , 


tone , 


X02, 


ldX) ; 


U01 , 


ldU , 


tone , 


Xll , 


ldX) ; 


X10, 


ldX , 


tone , 


X20, 


ldX) ; 


X01 , 


ldX , 


tone , 


Xll , 


ldX) ; 



U12, 


ldU , 


tone , 


X12, 


ldX) ; 


Xll , 


ldX , 


tone , 


X21 , 


ldX) ; 


U02, 


ldU , 


tone , 


X12, 


ldX) ; 


X01 , 


ldX , 


tone , 


X21 , 


ldX) ; 



Listing B.9: sylv.c (7 algorithm updates) — Solution of Sylvester Equation. 



// XOl = sylv (LOO, Ull, X01) 

sylv_(faX01m, faXOln, LOO, IdL, Ull, ldU , XOl, ldX , blocksize); 

// Xll = -X10 U01 + Xll 

dgemm_("N", "N" , &Xllm, faXlln, faXlOn , fanone, X10 , ldX , U01 , ldU , tone, Xll, ldX) ; 

// X21 = -X20 U01 + X21 

dgemm_("N", "N" , &X21m, faX21n , faX20n , fanone, X20 , ldX , U01 , ldU , tone, X21 , ldX) ; 

// X02 = -XOl U12 + X02 



B.3. SYLVESTER EQUATION: SOLVING LX + XU = 



C FOR X 



87 



528 
529 
530 
531 
532 
533 
534 
535 
536 
537 
538 
539 
540 
541 
542 



dgemm_( "N" , "N" , &X02m, &X02n , &X01n , femone, X01 , ldX , U12 , ldU , fame , X02 , ldX) 

// Xll = -L10 X01 + Xll 

dgemm_( "N" , "N" , &Xllm, faXlln, &L10n , fanone , L10 , IdL , X01 , ldX , fane, Xll, ldX) 

// Xll = sylv(Lll, Ull, Xll) 

sylv_(faXllm, &Xlln, Lll , IdL, Ull, ldU , Xll, ldX , blocksize); 

// X12 = -Xll U12 + X12 

dgemm_( "N" , "N" , &X12m, &X12n , &Xlln, fanone, Xll, ldX , U12 , ldU , fane, X12 , ldX) 

// X21 = -L21 Xll + X21 

dgemm_( "N" , "N" , &X21m, &X21n , &L21n , fanone, L21 , IdL, Xll, ldX , fane, X21 , ldX) 

// X12 = -X10 U02 + X12 

dgemm_( "N" , "N" , &X12m, &X12n , &X10n , fanone, X10 , ldX , U02 , ldU , fane, X12 , ldX) 

// X21 = -L20 X01 + X21 

dgemm_("N", "N" , &X21m, &X21n , &L20n , fanone, L20 , IdL, X01 , ldX , fane, X21 , ldX) 

// X21 = sylv(L22, Ull, X21) 

sylv_(faX21m, &X21n, L22 , IdL, Ull, ldU , X21 , ldX , blocksize); 



Listing B.10: sylv.c (8 algorithm updates) — Solution of Sylvester Equation. 



589 
590 
591 
592 
593 
594 
595 
596 
597 
598 
599 
600 
601 
602 
603 
604 
605 
606 



// X01 = sylv(L00, Ull, X01) 

sylv_(&X01m, &X01n, LOO, IdL, Ull, ldU , X01 , ldX , blocksize); 

// X02 = -X01 U12 + X02 

dgemm_( "N" , "N" , &X02m, &X02n , &X01n , fanone, X01 , ldX , U12 , ldU , fane, X02 , ldX) ; 

// Xll = -L10 X01 + Xll 

dgemm_("N", "N" , &Xllm, faXlln, &L10n , fanone, L10 , IdL, X01 , ldX , fane, Xll, ldX) ; 

// Xll = sylv(Lll, Ull, Xll) 

sylv_(faXllm, &Xlln, Lll, IdL, Ull, ldU , Xll, ldX , blocksize); 

// X12 = -Xll U12 + X12 

dgemm_( "N" , "N" , &X12m, &X12n , &Xlln, fanone, Xll, ldX , U12 , ldU , fane, X12 , ldX) ; 

// X21 = -L21 Xll + X21 

dgemm_( "N" , "N" , &X21m, &X21n , &L21n , fanone, L21 , IdL, Xll, ldX , fane, X21 , ldX) ; 

// X21 = -L20 X01 + X21 

dgemm_("N", "N" , &X21m, &X21n , &L20n , fanone, L20 , IdL, X01 , ldX , fane, X21 , ldX) ; 

// X21 = sylv(L22, Ull, X21) 

sylv_(faX21m, &X21n, L22 , IdL, Ull, ldU , X21 , ldX , blocksize); 

// X22 = -X21 U12 + X22 

dgemm_( "N" , "N" , &X22m, &X22n , &X21n , fanone, X21 , ldX , U12 , ldU , fane, X22 , ldX) ; 



Listing B.ll: sylv.c (9 algorithm updates) — Solution of Sylvester Equation. 



653 
654 
655 
656 
657 
658 
659 
660 
661 
662 
663 
664 
665 



// X01 = sylv(L00, Ull, X01) 

sylv_(faX01m, &X01n, LOO, IdL, Ull, ldU , X01 , ldX , blocksize); 

// X02 = -X01 U12 + X02 

dgemm_( "N" , "N" , &X02m, &X02n , &X01n , fanone, X01 , ldX , U12 , ldU , fane, X02 , ldX) 

// Xll = -L10 X01 + Xll 

dgemm_("N", "N" , &Xllm, faXlln, &L10n , fanone, L10 , IdL, X01 , ldX , fane, Xll, ldX) 

// Xll = sylv(Lll, Ull, Xll) 

sylv_(faXllm, &Xlln, Lll, IdL, Ull, ldU , Xll, ldX , blocksize); 

// X12 = -Xll U12 + X12 

dgemm_( "N" , "N" , &X12m, &X12n , &Xlln, fanone, Xll, ldX , U12 , ldU , fane, X12 , ldX) 

// X21 = -L21 Xll + X21 

dgemm_( "N" , "N" , &X21m, &X21n , &L21n , fanone, L21 , IdL, Xll, ldX , fane, X21 , ldX) 

// X21 = -L20 X01 + X21 

dgemm_( "N" , "N" , &X21m, &X21n , &L20n , fanone, L20 , IdL, X01 , ldX , fane, X21 , ldX) 

// X21 = sylv(L22, Ull, X21) 

sylv_(faX21m, &X21n, L22 , IdL, Ull, ldU , X21 , ldX , blocksize); 

// X22 = -X21 U12 + X22 

dgemm_( "N" , "N" , &X22m, &X22n , &X21n , fanone, X21 , ldX , U12 , ldU , fane, X22 , ldX) 



Listing B.12: sylv.c (10 th algorithm updates) — Solution of Sylvester Equation. 
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719 
720 
721 
722 
723 
724 
725 
726 
727 
728 
729 
730 
731 
732 
733 
734 



// X10 = -L10 XOO + X10 

dgemm_( "N" , "N" , &X10m, faXlOn , &L10n , fanone, L10 , IdL , XOO, ldX , fcone, X10 , ldX) ; 

// X21 = -L20 X01 + X21 

dgemm_("N", "N" , &X21m, faX21n , &L20n , fanone, L20 , IdL, X01 , ldX , tone , X21 , ldX) ; 

// X10 = sylv(Lll, UOO, X10) 

sylv_(faX10m, faXlOn , Lll , IdL, UOO, ldU , X10 , ldX , blocksize); 

// Xll = -X10 U01 + Xll 

dgemm_("N", "N" , &Xllm, faXlln, faXlOn , fanone, X10 , ldX , U01 , ldU , fane, Xll, ldX) ; 

// Xll = sylv(Lll, Ull, Xll) 

sylv_(faXllm, faXlln, Lll, IdL, Ull, ldU , Xll, ldX , blocksize); 

// X12 = -Xll U12 + X12 

dgemm_("N", "N" , &X12m, faX12n , faXlln, fanone, Xll, ldX , U12 , ldU , fane, X12 , ldX) ; 

// X12 = -X10 U02 + X12 

dgemm_( "N" , "N" , &X12m, faX12n , faXlOn , fanone, X10 , ldX , U02 , ldU , fane, X12 , ldX) ; 

// X12 = sylv(Lll, U22, X12) 

sylv_(faX12m, faX12n , Lll, IdL, U22 , ldU , X12 , ldX , blocksize); 

// X22 = -L21 X12 + X22 

dgemm_("N", "N" , &X22m, faX22n , &L21n , fanone, L21 , IdL, X12 , ldX , fane, X22 , ldX) ; 



Listing B.13: sylv.c (11 algorithm updates) — Solution of Sylvester Equation. 



781 
782 
783 
784 
785 
786 
787 
788 
789 
790 
791 
792 
793 
794 
795 
796 
797 
798 



// X10 = -L10 XOO + X10 

dgemm_( "N" , "N" , &X10m, faXlOn , &L10n , fanone, L10 , IdL, XOO, ldX , fane, X10 , ldX) ; 

// X21 = -L20 X01 + X21 

dgemm_( "N" , "N" , &X21m, faX21n , &L20n , fanone, L20 , IdL, X01 , ldX , fane, X21 , ldX) ; 

// X10 = sylv(Lll, UOO, X10) 

sylv_(faX10m, faXlOn , Lll, IdL, UOO, ldU , X10 , ldX , blocksize); 

// Xll = -X10 U01 + Xll 

dgemm_( "N" , "N" , &Xllm, faXlln, faXlOn , fanone, X10 , ldX , U01 , ldU , fane, Xll, ldX) ; 

// Xll = sylv(Lll, Ull, Xll) 

sylv_(faXllm, faXlln, Lll, IdL, Ull, ldU , Xll, ldX , blocksize); 

// X12 = -Xll U12 + X12 

dgemm_( "N" , "N" , &X12m, &X12n , faXlln, fanone, Xll, ldX , U12 , ldU , fane, X12 , ldX) ; 

// X12 = -X10 U02 + X12 

dgemm_("N", "N" , &X12m, faX12n , faXlOn , fanone, X10 , ldX , U02 , ldU , fane, X12 , ldX) ; 

// X12 = sylv(Lll, U22, X12) 

sylv_(faX12m, faX12n , Lll, IdL, U22 , ldU , X12 , ldX , blocksize); 

// X22 = -L21 X12 + X22 

dgemm_("N", "N" , &X22m, faX22n , &L21n , fanone, L21 , IdL, X12 , ldX , fane, X22 , ldX) ; 



Listing B.14: sylv.c (12 algorithm updates) — Solution of Sylvester Equation. 



850 
851 



854 
855 



857 
858 



861 
862 
863 
864 
865 



// X10 = -L10 XOO + X10 

dgemm_( "N" , "N" , &X10m, faXlOn , &L10n , fanone, L10 , IdL, XOO, ldX , fane, X10 , ldX) 

// X21 = -L20 X01 + X21 

dgemm_("N", "N" , &X21m, faX21n , &L20n , fanone, L20 , IdL, X01 , ldX , fane, X21 , ldX) 

// X10 = sylv(Lll, UOO, X10) 

sylv_(faX10m, faXlOn , Lll, IdL, UOO, ldU , X10 , ldX , blocksize); 

// Xll = -X10 U01 + Xll 

dgemm_( "N" , "N" , &Xllm, faXlln, faXlOn , fanone, X10 , ldX , U01 , ldU , fane, Xll, ldX) 

// Xll = sylv(Lll, Ull, Xll) 

sylv_(faXllm, faXlln, Lll, IdL, Ull, ldU , Xll, ldX , blocksize); 

// X12 = -Xll U12 + X12 

dgemm_( "N" , "N" , &X12m, &X12n , faXlln, fanone, Xll, ldX , U12 , ldU , fane, X12 , ldX) 

// X12 = -X10 U02 + X12 

dgemm_("N", "N" , &X12m, faX12n , faXlOn , fanone, X10 , ldX , U02 , ldU , fane, X12 , ldX) 

// X12 = sylv(Lll, U22, X12) 

sylv_(faX12m, faX12n , Lll, IdL, U22 , ldU , X12 , ldX , blocksize); 

// X22 = -L21 X12 + X22 

dgemm_("N", "N" , &X22m, faX22n , &L21n , fanone, L21 , IdL, X12 , ldX , fane, X22 , ldX) 
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Listing B.15: sylv.c (13 th algorithm updates) — Solution of Sylvester Equation. 



913 
914 
915 
916 
917 
918 
919 
020 
921 
922 
923 
924 
925 
926 
927 
928 
929 
930 
931 
932 
933 
934 



// Xll = -X10 U01 + Xll 
dgemm_("N", "N" , tXllm, tXlli 
// X21 = -X20 U01 + X21 
dgemm_( "N" , "N" , &X21m, tX21i 
// Xll = -L10 XOl + Xll 
dgemm_("N", "N" , &Xllm, &XII1 
// Xll = sylv(Lll, Ull, Xll) 
sylv_(tXllm, tXlln, Lll , IdL 
// X12 = -Xll U12 + X12 
dgemm_("N", "N" , &X12m, tX12i 
// X21 = -L21 Xll + X21 
dgemm_( "N" , "N" , tX21m, &X21i 
// X12 = -X10 U02 + X12 
dgemm_("N", "N" , &X12m, &X12i 
// X21 = -L20 XOl + X21 
dgemm_("N", "N" , tX21m, &X21i 
// X12 = -L10 X02 + X12 
dgemm_( "N" , "N" , tX12m, tX12i 
// X21 = sylv(L22, Ull, X21) 
sylv_(tX21m, tX21n, L22 , IdL 
// X12 = sylv(Lll, U22, X12) 



Ssmone , 


X10, 


ldX , U01 , ldU , 


tone , 


Xll , 


ldX) ; 


tmone , 


X20, 


ldX , U01 , ldU , 


tone , 


X21 , 


ldX) ; 


femone , 


L10 , 


IdL , XOl , ldX , 


tone , 


Xll , 


ldX) ; 


> Xll, 


ldX , 


blocksize ) ; 








fcmone , 


Xll , 


ldX , U12 , ldU , 


tone , 


X12, 


ldX) ; 


femone , 


L21 , 


IdL, Xll, ldX, 


tone , 


X21 , 


ldX) ; 


tmone , 


X10, 


ldX , U02 , ldU , 


tone , 


X12, 


ldX) ; 


femone , 


L20 , 


IdL , XOl , ldX , 


tone , 


X21 , 


ldX) ; 


femone , 


L10 , 


IdL , X02 , ldX , 


tone , 


X12, 


ldX) ; 


, X21, 


ldX , 


blocksize ) ; 








, X12, 


ldX , 


blocksize ) ; 









Listing B.16: sylv.c (14 th algorithm updates) — Solution of Sylvester Equation. 



981 
982 
983 
984 
985 
986 
987 
988 
989 
990 
991 
992 
993 
994 



997 
998 



// Xll = -X10 U01 + Xll 

dgemm_("N", "N" , tXllm, tXlln, tXlOn , tmone, X10 , ldX , U01 , ldU , tone, Xll, ldX) 

// X21 = -X20 U01 + X21 

dgemm_("N", "N" , tX21m, &X21n , tX20n , tmone, X20 , ldX , U01 , ldU , tone, X21 , ldX) 

// Xll = sylv(Lll, Ull, Xll) 

sylv_(tXllm, tXlln, Lll, IdL, Ull, ldU , Xll, ldX , blocksize); 

// X12 = -Xll U12 + X12 

dgemm_( "N" , "N" , tX12m, tX12n , tXlln, tmone, Xll, ldX , U12 , ldU , tone, X12 , ldX) 

// X21 = -L21 Xll + X21 

dgemm_("N", "N" , tX21m, tX21n , tL21n , tmone, L21 , IdL, Xll, ldX , tone, X21 , ldX) 

// X12 = -X10 U02 + X12 

dgemm_("N", "N" , tX12m, tX12n , tXlOn , tmone, X10 , ldX , U02 , ldU , tone, X12 , ldX) 

// X21 = sylv(L22, Ull, X21) 

sylv_(tX21m, tX21n , L22 , IdL, Ull, ldU , X21 , ldX , blocksize); 

// X12 = sylv(Lll, U22, X12) 

sylv_(tX12m, tX12n , Lll, IdL, U22 , ldU , X12 , ldX , blocksize); 

// X22 = -L21 X12 + X22 

dgemm_("N", "N" , tX22m, tX22n , tL21n , tmone, L21 , IdL, X12 , ldX , tone, X22 , ldX) 



Listing B.17: sylv.c (15 th algorithm updates) — Solution of Sylvester Equation. 



1045 


// Xll = -L10 XOl + Xll 






















1046 


dgemm ( "N" , "N" , tXllm, 


tXlln 


tLlOn , 


tmone , 


L10 , 


IdL , 


XOl , 


ldX , 


tone , 


Xll , 


ldX) ; 


1047 


// Xll = sylv (Lll , Ull , 


Xll) 




















1048 


sylv (tXllm, tXlln, Lll 


IdL , 


Ull, ldU, Xll, 


ldX , 


blocksize ) ; 








1049 


// X12 = -Xll U12 + X12 






















1050 


dgemm ( "N" , "N" , tX12m, 


tX12n 


tXlln, 


tmone , 


Xll , 


ldX , 


U12 , 


ldU , 


tone , 


X12, 


ldX) ; 


1051 


// X21 = -L21 Xll + X21 






















1052 


dgemm ( "N" , "N" , tX21m, 


tX21n 


tL21n , 


tmone , 


L21 , 


IdL , 


Xll , 


ldX , 


tone , 


X21 , 


ldX) ; 


1053 


// X12 = -L10 X02 + X12 






















1054 


dgemm ( "N" , "N" , tX12m, 


tX12n 


tLlOn , 


tmone , 


L10 , 


IdL , 


X02 , 


ldX , 


tone , 


X12, 


ldX) ; 


1055 


// X21 = -L20 XOl + X21 






















1056 


dgemm ( "N" , "N" , tX21m, 


tX21n 


tL20n , 


tmone , 


L20 , 


IdL , 


XOl , 


ldX , 


tone , 


X21 , 


ldX); 


1057 


// X12 = sylv (Lll , U22, 


X12) 
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loss sylv_(faX12m, &X12n , Lll , IdL , U22 , ldU , X12 , ldX , blocksize); 

1059 // X21 = sylv(L22, Ull , X21) 

loeo sylv_(faX21m, &X21n , L22 , IdL, Ull, ldU , X21 , ldX , blocksize); 

1061 // X22 = -X21 U12 + X22 

1062 dgemm_( "N" , "N" , &X22m, &X22n , &X21n , fanone, X21 , ldX , U12 , ldU , fame , X22 , ldX) ; 



Listing B.18: sylv.c (16 th algorithm updates) — Solution of Sylvester Equation. 

nog // Xll = sylv(Lll, Ull, Xll) 



mo sylv_ (&Allm, &jXlln, Lll, IdL 

mi // X12 = -Xll U12 + X12 

1112 dgemm_("N", "N" , &X12m, faX12i 

ins // X21 = -L21 Xll + X21 

iii4 dgemm_("N", "N" , &X21m, faX21i 

ins // X12 = sylv(Lll, U22, X12) 

me sylv_(faX12m, &X12n , Lll, IdL 

iii7 / X21 = sylv(L22, Ull, X21) 

ins sylv_(faX21m, &X21n , L22 , IdL 

mo -X21 U12 + X22 

1120 dgemm_( "N" , "N" , &X22m, faX22i 

1121 // X22 = -L21 X12 + X22 

1122 dgemm ( "N" , "N" , &X22m, &X22i 



ldX , 


blocksize ) ; 








Xll , 


ldX , U12 , ldU , 


fame , 


X12, 


ldX) ; 


L21 , 


IdL , Xll , ldX , 


fame , 


X21 , 


ldX) ; 


ldX , 


blocksize ) ; 








ldX , 


blocksize ) ; 








X21 , 


ldX , U12 , ldU , 


fame , 


X22, 


ldX) ; 


L21 , 


IdL , X12 , ldX , 


fame , 


X22, 


ldX) ; 



